Problem Statement

Business Context

The prices of the stocks of companies listed under a global exchange are influenced by a variety of factors, with the company's financial performance, innovations and collaborations, and market sentiment being factors that play a significant role. News and media reports can rapidly affect investor perceptions and, consequently, stock prices in the highly competitive financial industry. With the sheer volume of news and opinions from a wide variety of sources, investors and financial analysts often struggle to stay updated and accurately interpret its impact on the market. As a result, investment firms need sophisticated tools to analyze market sentiment and integrate this information into their investment strategies.

Problem Definition

With an ever-rising number of news articles and opinions, an investment startup aims to leverage artificial intelligence to address the challenge of interpreting stock-related news and its impact on stock prices. They have collected historical daily news for a specific company listed under NASDAQ, along with data on its daily stock price and trade volumes.

As a member of the Data Science and AI team in the startup, you have been tasked with analyzing the data, developing an AI-driven sentiment analysis system that will automatically process and analyze news articles to gauge market sentiment, and summarizing the news at a weekly level to enhance the accuracy of their stock price predictions and optimize investment strategies. This will empower their financial analysts with actionable insights, leading to more informed investment decisions and improved client outcomes.

Data Dictionary

  • Date : The date the news was released
  • News : The content of news articles that could potentially affect the company's stock price
  • Open : The stock price (in \$) at the beginning of the day
  • High : The highest stock price (in \$) reached during the day
  • Low : The lowest stock price (in \$) reached during the day
  • Close : The adjusted stock price (in \$) at the end of the day
  • Volume : The number of shares traded during the day
  • Label : The sentiment polarity of the news content
    • 1: positive
    • 0: neutral
    • -1: negative

Please read the instructions carefully before starting the project.

Note: If the free-tier GPU of Google Colab is not accessible (due to unavailability or exhaustion of daily limit or other reasons), the following steps can be taken:

  1. Wait for 12-24 hours until the GPU is accessible again or the daily usage limits are reset.

  2. Switch to a different Google account and resume working on the project from there.

  3. Try using the CPU runtime:

    • To use the CPU runtime, click on Runtime => Change runtime type => CPU => Save
    • One can also click on the Continue without GPU option to switch to a CPU runtime (kindly refer to the snapshot below)
    • The instructions for running the code on the CPU are provided in the respective sections of the notebook.

GPU_Issue.png

Installing and Importing Necessary Libraries

In [1]:
#included this line to elimate the gensim and numpy dependency and version issues
!pip install --upgrade pip -q
In [2]:
# installing the sentence-transformers and gensim libraries for word embeddings
!pip install -U sentence-transformers gensim transformers tqdm -q

restart the session and execute the above 2 lines to eliminate the numpy error dependencies.

In [3]:
# to read and manipulate the data
import pandas as pd
import numpy as np
pd.set_option('max_colwidth', None)    # setting column to the maximum column width as per the data

# to visualise data
import matplotlib.pyplot as plt
import seaborn as sns

# to use regular expressions for manipulating text data
import re

# to load the natural language toolkit
import nltk
nltk.download('stopwords')    # loading the stopwords
nltk.download('wordnet')    # loading the wordnet module that is used in stemming

# to remove common stop words
from nltk.corpus import stopwords

# to perform stemming
from nltk.stem.porter import PorterStemmer

# To encode the target variable
from sklearn.preprocessing import LabelEncoder

# Patching scipy.linalg before importing gensim
import scipy.linalg    # Import scipy.linalg
from numpy import triu    # Import triu from numpy
scipy.linalg.triu = triu    # Inject triu into scipy.linalg

# To import Word2Vec
from gensim.models import Word2Vec

import sklearn.metrics as metrics
# To tune the model
from sklearn.model_selection import GridSearchCV

# Converting the Stanford GloVe model vector format to word2vec
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

# Deep Learning library
import torch

# to load transformer models
from sentence_transformers import SentenceTransformer

# To split data into train and test sets
from sklearn.model_selection import train_test_split

# To build a Random Forest model
from sklearn.ensemble import RandomForestClassifier

# To compute metrics to evaluate the model
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score, classification_report
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...

Loading the dataset

In [4]:
# mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [5]:
# loading the dataset
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/NLP/Project/stock_news.csv')
In [6]:
data = df.copy()

Data Overview

In [7]:
data.head()
Out[7]:
Date News Open High Low Close Volume Label
0 2019-01-02 The tech sector experienced a significant decline in the aftermarket following Apple's Q1 revenue warning. Notable suppliers, including Skyworks, Broadcom, Lumentum, Qorvo, and TSMC, saw their stocks drop in response to Apple's downward revision of its revenue expectations for the quarter, previously announced in January. 41.740002 42.244999 41.482498 40.246914 130672400 -1
1 2019-01-02 Apple lowered its fiscal Q1 revenue guidance to $84 billion from earlier estimates of $89-$93 billion due to weaker than expected iPhone sales. The announcement caused a significant drop in Apple's stock price and negatively impacted related suppliers, leading to broader market declines for tech indices such as Nasdaq 10 41.740002 42.244999 41.482498 40.246914 130672400 -1
2 2019-01-02 Apple cut its fiscal first quarter revenue forecast from $89-$93 billion to $84 billion due to weaker demand in China and fewer iPhone upgrades. CEO Tim Cook also mentioned constrained sales of Airpods and Macbooks. Apple's shares fell 8.5% in post market trading, while Asian suppliers like Hon 41.740002 42.244999 41.482498 40.246914 130672400 -1
3 2019-01-02 This news article reports that yields on long-dated U.S. Treasury securities hit their lowest levels in nearly a year on January 2, 2019, due to concerns about the health of the global economy following weak economic data from China and Europe, as well as the partial U.S. government shutdown. Apple 41.740002 42.244999 41.482498 40.246914 130672400 -1
4 2019-01-02 Apple's revenue warning led to a decline in USD JPY pair and a gain in Japanese yen, as investors sought safety in the highly liquid currency. Apple's underperformance in Q1, with forecasted revenue of $84 billion compared to analyst expectations of $91.5 billion, triggered risk aversion mood in markets 41.740002 42.244999 41.482498 40.246914 130672400 -1
In [8]:
# checking a stock news
data.loc[3, 'News']
Out[8]:
' This news article reports that yields on long-dated U.S. Treasury securities hit their lowest levels in nearly a year on January 2, 2019, due to concerns about the health of the global economy following weak economic data from China and Europe, as well as the partial U.S. government shutdown. Apple'
In [9]:
data.shape
Out[9]:
(349, 8)

Observation : There are 349 rows and 8 columns.

In [10]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349 entries, 0 to 348
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    349 non-null    object 
 1   News    349 non-null    object 
 2   Open    349 non-null    float64
 3   High    349 non-null    float64
 4   Low     349 non-null    float64
 5   Close   349 non-null    float64
 6   Volume  349 non-null    int64  
 7   Label   349 non-null    int64  
dtypes: float64(4), int64(2), object(2)
memory usage: 21.9+ KB

Observations:

  • There are 6 numerical columns and 2 object columns.
  • We need to convert Date column which is object type to date type
In [11]:
# changing the data type of Date column
data['Date'] = pd.to_datetime(data['Date'])
In [12]:
#verify Date column after converting to datetime type
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349 entries, 0 to 348
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    349 non-null    datetime64[ns]
 1   News    349 non-null    object        
 2   Open    349 non-null    float64       
 3   High    349 non-null    float64       
 4   Low     349 non-null    float64       
 5   Close   349 non-null    float64       
 6   Volume  349 non-null    int64         
 7   Label   349 non-null    int64         
dtypes: datetime64[ns](1), float64(4), int64(2), object(1)
memory usage: 21.9+ KB

Observations:

  • We see that the Date column is converted to datetime type.
  • There are 6 numerical columns, 1 datetime column and 1 object column.
In [13]:
data.describe().T
Out[13]:
count mean min 25% 50% 75% max std
Date 349 2019-02-16 16:05:30.085959936 2019-01-02 00:00:00 2019-01-14 00:00:00 2019-02-05 00:00:00 2019-03-22 00:00:00 2019-04-30 00:00:00 NaN
Open 349.0 46.229233 37.567501 41.740002 45.974998 50.7075 66.817497 6.442817
High 349.0 46.700458 37.817501 42.244999 46.025002 50.849998 67.0625 6.507321
Low 349.0 45.745394 37.305 41.482498 45.639999 49.7775 65.862503 6.391976
Close 349.0 44.926317 36.254131 40.246914 44.596924 49.11079 64.805229 6.398338
Volume 349.0 128948236.103152 45448000.0 103272000.0 115627200.0 151125200.0 244439200.0 43170314.918964
Label 349.0 -0.054441 -1.0 -1.0 0.0 0.0 1.0 0.715119

Observations:

  • Open: Mean Open price is 46.22 dollars, min price is 37.56 dollars and max price is 66.81 dollars.
  • High: Mean High price is 46.70 dollars, min price is 37.81 dollars and max price is 67.06 dollars.
  • Low: Mean Low price is 45.74 dollars,min price is 37.30 dollars and max price is 65.86 dollars.
  • Volume: Mean Volume is 128948236, min Volume is 45448000 and max Volume is 244439200.
In [14]:
#check for null values
data.isnull().sum()
Out[14]:
0
Date 0
News 0
Open 0
High 0
Low 0
Close 0
Volume 0
Label 0

Observations:

  • There are no null values.
In [15]:
#check for duplicates
data.duplicated().sum()
Out[15]:
0

Observations:

  • There are no duplicates.

Exploratory Data Analysis

Univariate Analysis

  • Distribution of individual variables
  • Compute and check the distribution of the length of news content

Label

In [16]:
# function to create labeled barplots

def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        legend=False,
        hue=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

Distribution of Sentiments

In [17]:
labeled_barplot(data, "Label", perc=True)

Observations:

  • Class 0 (Neutral): 48.7% is neutral sentiment and is the majority class
  • Class -1(Negative): 28.4% is negative sentiment and is the second highest class.
  • Class 1(Positive): 22.9% is positive sentiment and is the class with the least count.

    We see that the data is imbalanced.

Open Price

In [18]:
sns.histplot(data['Open'],kde=True)
plt.show()

Observations:

  • Open price is right skewed distribution.
In [19]:
#calculate the mode
data['Open'].mode()
Out[19]:
Open
0 43.57

Observations:

  • Mean Open price(46.22) > Median Open price(45.97) > Mode Open price(43.57)
In [20]:
# Creating a time series DataFrame with 'Date' as index
timeseries_df = data.set_index('Date')
In [21]:
timeseries_df['Month'] = timeseries_df.index.month  # Extract month from the index
sns.boxplot(x='Month', y='Open', data=timeseries_df)  # Box plot of 'Close' prices by month
plt.show()

Observations:

  • There are outliers in Month 1, 3 and 4.
  • Month 2 Open price has the wide range of prices.

High Price

In [22]:
sns.histplot(data['High'],kde=True)
plt.show()

Observations:

  • High stock price is right skewed distribution.
In [23]:
#calculate the mode of High stock price
data['High'].mode()
Out[23]:
High
0 43.787498

Observations:

  • Mean High price(46.70) > Median High price(46.02) > Mode High price(43.78)
In [24]:
timeseries_df['Month'] = timeseries_df.index.month  # Extract month from the index
sns.boxplot(x='Month', y='High', data=timeseries_df)  # Box plot of 'Close' prices by month
plt.show()

Observations:

  • There are outliers in month 1, 3 and 4.
  • Month 2 Open price has the wide range of prices.

Low Price

In [25]:
sns.histplot(data['Low'],kde=True)
plt.show()

Observations:

Low stock price is right skewed distribution.

In [26]:
#calculate the mode of Low stock price
data['Low'].mode()
Out[26]:
Low
0 43.2225

Observations:

  • Mean Low Price(45.74)>Median Low Price(45.63)>Mode Low Price(43.22)
In [27]:
timeseries_df['Month'] = timeseries_df.index.month  # Extract month from the index
sns.boxplot(x='Month', y='Low', data=timeseries_df)  # Box plot of 'Close' prices by month
plt.show()

Observations:

  • There are outliers in month 1, 3 and 4.
  • Month 2 Low price has the wide range of prices.

Close Price

In [28]:
sns.histplot(data['Close'],kde=True)
plt.show()

Observations:

  • Its a right skewed distribution.
In [29]:
#calculate the mode
data['Close'].mode()
Out[29]:
Close
0 42.470604

Observations:

  • Mean Close price(44.92) > Median Close price(44.59) > Mode Close price(42.47)
In [30]:
timeseries_df['Month'] = timeseries_df.index.month  # Extract month from the index
sns.boxplot(x='Month', y='Close', data=timeseries_df)  # Box plot of 'Close' prices by month
plt.show()

Observations:

  • There are outliers in all the months.

Volume

In [31]:
sns.histplot(data['Volume'],kde=True)
plt.show()

Observations:

  • Volume is a right skewed distribution.
In [32]:
sns.boxplot(x=data['Volume'])
plt.show()

Observations:

  • There is outlier.

Compute and check the distribution of the length of news content

In [33]:
#calculate the length of each news article
data['news_length'] = data['News'].apply(len)
In [34]:
#check the newly added column 'news_length'
data.head()
Out[34]:
Date News Open High Low Close Volume Label news_length
0 2019-01-02 The tech sector experienced a significant decline in the aftermarket following Apple's Q1 revenue warning. Notable suppliers, including Skyworks, Broadcom, Lumentum, Qorvo, and TSMC, saw their stocks drop in response to Apple's downward revision of its revenue expectations for the quarter, previously announced in January. 41.740002 42.244999 41.482498 40.246914 130672400 -1 324
1 2019-01-02 Apple lowered its fiscal Q1 revenue guidance to $84 billion from earlier estimates of $89-$93 billion due to weaker than expected iPhone sales. The announcement caused a significant drop in Apple's stock price and negatively impacted related suppliers, leading to broader market declines for tech indices such as Nasdaq 10 41.740002 42.244999 41.482498 40.246914 130672400 -1 323
2 2019-01-02 Apple cut its fiscal first quarter revenue forecast from $89-$93 billion to $84 billion due to weaker demand in China and fewer iPhone upgrades. CEO Tim Cook also mentioned constrained sales of Airpods and Macbooks. Apple's shares fell 8.5% in post market trading, while Asian suppliers like Hon 41.740002 42.244999 41.482498 40.246914 130672400 -1 296
3 2019-01-02 This news article reports that yields on long-dated U.S. Treasury securities hit their lowest levels in nearly a year on January 2, 2019, due to concerns about the health of the global economy following weak economic data from China and Europe, as well as the partial U.S. government shutdown. Apple 41.740002 42.244999 41.482498 40.246914 130672400 -1 300
4 2019-01-02 Apple's revenue warning led to a decline in USD JPY pair and a gain in Japanese yen, as investors sought safety in the highly liquid currency. Apple's underperformance in Q1, with forecasted revenue of $84 billion compared to analyst expectations of $91.5 billion, triggered risk aversion mood in markets 41.740002 42.244999 41.482498 40.246914 130672400 -1 305

Check the distribution

Histogram

In [35]:
sns.histplot(data['news_length'], kde=True)
plt.xlabel('news length')
plt.ylabel('Frequency')
plt.title('Distribution of news length')
plt.show()

Observations:

  • Its a left skewed distribution.
In [36]:
print(data['news_length'].describe())
count    349.000000
mean     311.237822
std       39.079467
min      110.000000
25%      290.000000
50%      315.000000
75%      336.000000
max      394.000000
Name: news_length, dtype: float64

Observations:

  • Mean news length is 311 characters. Min is 110 characters and max is 394 characters.
In [37]:
sns.boxplot(x=data['news_length'])
plt.xlabel('news length')
plt.title('Box Plot of news length')
plt.show()

Observations:

  • There are outliers in news length.

Bivariate Analysis

  • Correlation
  • Sentiment Polarity vs Price
  • Date vs Price

Note: The above points are listed to provide guidance on how to approach bivariate analysis. Analysis has to be done beyond the above listed points to get maximum scores.

In [38]:
sns.pairplot(data=data,hue='Label')
Out[38]:
<seaborn.axisgrid.PairGrid at 0x7d625be1f850>

Observations:

  • We see a positive correlation between
    • Open price and High price
    • Open price and Low price
    • Open price and Close price
    • High Price and Low Price
    • High Price and Close Price
    • Low Price and Close Price
In [39]:
plt.figure(figsize=(15, 7))
numeric_df = data.select_dtypes(include=['number'])
sns.heatmap(numeric_df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Observations:

  • We see the highest positive correlation of 1.0 between the following
    • Open price and High price
    • Open price and Low price
    • Open price and Close price
    • High Price and Low Price
    • High Price and Close Price
    • Low Price and Close Price

Relationship of numerical variables on target variable

  • Sentiment Polarity vs Price
    • Sentiment polarity vs Open
    • Sentiment polarity vs High
    • Sentiment polarity vs Low
    • Sentiment polarity vs Close
  • Sentiment Polarity vs Volume
In [40]:
### Function to plot distributions

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 3, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
    )

    axs[0, 2].set_title("Distribution of target for target=" + str(target_uniq[2]))
    sns.histplot(
        data=data[data[target] == target_uniq[2]],
        x=predictor,
        kde=True,
        ax=axs[0, 2],
        color="blue",
    )
    axs[1, 0].set_title("Boxplot w.r.t target")
    #sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
    sns.boxplot(data=data, x=target, y=predictor, hue=target, ax=axs[1, 0], palette="gist_rainbow", legend=False)

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        hue=target,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
        legend=False
    )

    plt.tight_layout()
    plt.show()

Sentimental Polarity(Label) vs Open Price

In [41]:
distribution_plot_wrt_target(data, 'Open', 'Label')

Observations:

  • Distribution is left skewed for negative,neutral and positive sentiments.
  • There are outliers for all the sentiments.
  • Median Open price of neutral and positive is around 47 and negative sentiment is less which is around 43.

Sentimental Polarity(Label) vs High Price

In [42]:
distribution_plot_wrt_target(data, 'High', 'Label')

Observations:

  • Distribution is left skewed for negative,neutral and positive sentiments.
  • There are outliers for all the sentiments.
  • Median High price of neutral and positive is around 47 and negative sentiment is less which is around 43.

Sentimental Polarity(Label) vs Low Price

In [43]:
distribution_plot_wrt_target(data, 'Low', 'Label')

Observations:

  • Distribution is left skewed for negative,neutral and positive sentiments.
  • There are outliers for all the sentiments.
  • Median Open price of neutral and positive is around 47 and negative sentiment is less which is around 43.

Sentimental Polarity(Label) vs Close Price

In [44]:
distribution_plot_wrt_target(data, 'Close', 'Label')

Observations:

  • Distribution is left skewed for negative,neutral and positive sentiments.
  • There are outliers for all the sentiments.
  • Median Close price of neutral and positive is around 45 and negative sentiment is less which is around 42.

Sentimental Polarity(Label) vs Volume

In [45]:
distribution_plot_wrt_target(data, 'Volume', 'Label')

Observations:

  • Distribution is left skewed for negative,neutral and positive sentiments.
  • There are outliers for all the sentiments.
  • Median Volume of neutral sentiment is around 1.20x10^8, negative sentiment is around 1.125x10^8, positive sentiment is around 1.15x10^8.

Price vs Date

Plot Open,High,Low and Close Prices Vs Date

In [46]:
#build timeseries plot
# Create subplots
fig,axes = plt.subplots(4, 1, figsize=(10, 12), sharex=True)  # 4 rows, 1 column

# Plot each price on a separate subplot
sns.lineplot(x=timeseries_df.index, y=timeseries_df['Open'], label='Open', ax=axes[0])
sns.lineplot(x=timeseries_df.index, y=timeseries_df['High'], label='High', ax=axes[1])
sns.lineplot(x=timeseries_df.index, y=timeseries_df['Low'], label='Low', ax=axes[2])
sns.lineplot(x=timeseries_df.index, y=timeseries_df['Close'], label='Close', ax=axes[3])
# Add labels and titles
for ax, price_type in zip(axes, ['Open', 'High', 'Low', 'Close']):
    ax.set_ylabel(price_type + ' Price')
    ax.legend()
    ax.grid(True)

axes[3].set_xlabel('Date')  # X-axis label only on the bottom subplot
fig.suptitle('Time Series Plots of Stock Prices', fontsize=16)  # Overall title

plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels
plt.tight_layout()  # Adjust spacing
plt.show()  # Display the plot

Observations:

  • The graph pattern of Open,High, Low and Close price looks very similar over the period of time.
  • The stock price trend
    • begins to rise up from begining of the month, peaks before the mid of the month, the drops low by mid of the month.It stays low from mid of the month to the beginning of the next month and the cycle continues.
    • The above cycle is true for all stock prices - Open, High, Low and Close price.
  • The stock price range across all the prices are very similar.

Volume vs Date

In [47]:
#plot Volume vs date
sns.lineplot(x='Date', y='Volume', data=timeseries_df)
plt.xlabel('Date')
plt.ylabel('Volume')
plt.title('Time Series Plot of Volume')
plt.grid(True)
plt.xticks(rotation=45, ha='right')
plt.show()

Observations:

  • Highest volume traded is around 2019-02-1 and lowest volume traded is between 2019-03-01 and 2019-03-15.

Volume of stocks traded every month

In [48]:
#Volume of stocks traded every month
timeseries_df['Month'] = timeseries_df.index.month
monthly_volume = timeseries_df.groupby('Month')['Volume'].sum()
sns.lineplot(x= monthly_volume.index, y=monthly_volume.values)
plt.xlabel('Month')
plt.ylabel('Total Volume')
plt.title('Monthly Volume of Stocks Traded')
plt.show()

Observations:

  • Volume decreases significantly after month 1,it drops to 0.50x10^10 by month 2, it then raises to 1.00x10^10 by month 3 and drops to 0.75x10^10 by month 4

Price, Volume vs date

Plot Price and Volume in y axis and date on x axis to see how the change in price and Volume over a period of time happens and if there is a pattern.

Open Price, Volume vs date

In [49]:
fig, ax1 = plt.subplots(figsize=(10, 6))

ax1.plot(timeseries_df.index, timeseries_df['Open'], color='blue', label='Open Price')
ax1.set_xlabel('Date')
ax1.set_ylabel('Open Price', color='blue')
ax1.tick_params('y', labelcolor='blue')

ax2 = ax1.twinx()  # Create a secondary y-axis
ax2.plot(timeseries_df.index, timeseries_df['Volume'], color='red', label='Volume')
ax2.set_ylabel('Volume', color='red')
ax2.tick_params('y', labelcolor='red')

fig.tight_layout()
plt.title('Time Series Plot of Open Price and Volume')
plt.grid(True)
plt.xticks(rotation=45, ha='right')
plt.show()

Observations:

  • Changing trend in Volume and open price is not correlated.
  • Open Price and Volume peaks at different times.

High Price, Volume vs date

In [50]:
fig, ax1 = plt.subplots(figsize=(10, 6))

ax1.plot(timeseries_df.index, timeseries_df['High'], color='blue', label='High Price')
ax1.set_xlabel('Date')
ax1.set_ylabel('High Price', color='blue')
ax1.tick_params('y', labelcolor='blue')

ax2 = ax1.twinx()  # Create a secondary y-axis
ax2.plot(timeseries_df.index, timeseries_df['Volume'], color='red', label='Volume')
ax2.set_ylabel('Volume', color='red')
ax2.tick_params('y', labelcolor='red')

fig.tight_layout()
plt.title('Time Series Plot of High Price and Volume')
plt.grid(True)
plt.xticks(rotation=45, ha='right')
plt.show()

Observations:

  • Changing trend in Volume and High price is not correlated.
  • Open Price and Volume peaks at different times.

Low Price, Volume vs date

In [51]:
fig, ax1 = plt.subplots(figsize=(10, 6))

ax1.plot(timeseries_df.index, timeseries_df['Low'], color='blue', label='Low Price')
ax1.set_xlabel('Date')
ax1.set_ylabel('Low Price', color='blue')
ax1.tick_params('y', labelcolor='blue')

ax2 = ax1.twinx()  # Create a secondary y-axis
ax2.plot(timeseries_df.index, timeseries_df['Volume'], color='red', label='Volume')
ax2.set_ylabel('Volume', color='red')
ax2.tick_params('y', labelcolor='red')

fig.tight_layout()
plt.title('Time Series Plot of Low Price and Volume')
plt.grid(True)
plt.xticks(rotation=45, ha='right')
plt.show()

Observations:

  • Changing trend in Volume and Low price is not correlated.
  • Open Price and Volume peaks at different times.

Close Price, Volume vs date

In [52]:
fig, ax1 = plt.subplots(figsize=(10, 6))

ax1.plot(timeseries_df.index, timeseries_df['Close'], color='blue', label='Close Price')
ax1.set_xlabel('Date')
ax1.set_ylabel('Close Price', color='blue')
ax1.tick_params('y', labelcolor='blue')

ax2 = ax1.twinx()  # Create a secondary y-axis
ax2.plot(timeseries_df.index, timeseries_df['Volume'], color='red', label='Volume')
ax2.set_ylabel('Volume', color='red')
ax2.tick_params('y', labelcolor='red')

fig.tight_layout()
plt.title('Time Series Plot of Close Price and Volume')
plt.grid(True)
plt.xticks(rotation=45, ha='right')
plt.show()

Observations:

  • Changing trend in Volume and Close price is not correlated.
  • Open Price and Volume peaks at different times.

EDA Summary

Univariate Analysis

  • Class 0 (Neutral): 48.7% is neutral sentiment and is the majority class
  • Class -1(Negative): 28.4% is negative sentiment and is the second highest class.
  • Class 1(Positive): 22.9% is positive sentiment and is the class with the least count.
We see that the data is imbalanced.

Multivariate Analysis

  • Sentimental Polarity(Label) vs Open Price

    • Distribution is left skewed for negative,neutral and positive sentiments.
    • There are outliers for all the sentiments.
    • Median Open price of neutral and positive is around 47 and negative sentiment is less which is around 43.
  • Sentimental Polarity(Label) vs High Price

    • Distribution is left skewed for negative,neutral and positive sentiments.
    • There are outliers for all the sentiments.
    • Median High price of neutral and positive is around 47 and negative sentiment is less which is around 43.
  • Sentimental Polarity(Label) vs Low Price

    • Distribution is left skewed for negative,neutral and positive sentiments.
    • There are outliers for all the sentiments.
    • Median Open price of neutral and positive is around 47 and negative sentiment is less which is around 43.
  • Sentimental Polarity(Label) vs Close Price

    • Distribution is left skewed for negative,neutral and positive sentiments.
    • There are outliers for all the sentiments.
    • Median Close price of neutral and positive is around 45 and negative sentiment is less which is around 42.
  • Sentimental Polarity(Label) vs Volume

    • Distribution is left skewed for negative,neutral and positive sentiments.
    • There are outliers for all the sentiments.
    • Median Volume of neutral sentiment is around 1.20x10^8, negative sentiment is around 1.125x10^8, positive sentiment is around 1.15x10^8.
  • Price vs Date

    • The graph pattern of Open,High, Low and Close price looks very similar over the period of time.
    • The stock price trend
      • begins to rise up from begining of the month, peaks before the mid of the month, the drops low by mid of the month.It stays low from mid of the month to the beginning of the next month and the cycle continues.
      • The above cycle is true for all stock prices - Open, High, Low and Close price.
    • The stock price range across all the prices are very similar.
  • Volume of stocks traded every month

    • Volume decreases significantly after month 1,it drops to 0.50x10^10 by month 2, it then raises to 1.00x10^10 by month 3 and drops to 0.75x10^10 by month 4.

Data Preprocessing

In [53]:
dataset = data.copy()

Preprocessing the textual column

In [54]:
# Loading the Porter Stemmer
ps = PorterStemmer()
In [55]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove special characters and numbers
    text = re.sub(r'[^A-Za-z\s]', '', text)

    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    # Split text into separate words
    words = text.split()

    # Removing English language stopwords
    text = ' '.join([word for word in words if word not in stopwords.words('english')])

    # Applying the Porter Stemmer on every word of a message and joining the stemmed words back into a single string
    text = ' '.join([ps.stem(word) for word in words])

    return text
In [56]:
# preprocessing the textual column
dataset['News_clean'] = dataset['News'].apply(preprocess_text)
In [57]:
#display cleaned text
dataset.head()
Out[57]:
Date News Open High Low Close Volume Label news_length News_clean
0 2019-01-02 The tech sector experienced a significant decline in the aftermarket following Apple's Q1 revenue warning. Notable suppliers, including Skyworks, Broadcom, Lumentum, Qorvo, and TSMC, saw their stocks drop in response to Apple's downward revision of its revenue expectations for the quarter, previously announced in January. 41.740002 42.244999 41.482498 40.246914 130672400 -1 324 the tech sector experienc a signific declin in the aftermarket follow appl q revenu warn notabl supplier includ skywork broadcom lumentum qorvo and tsmc saw their stock drop in respons to appl downward revis of it revenu expect for the quarter previous announc in januari
1 2019-01-02 Apple lowered its fiscal Q1 revenue guidance to $84 billion from earlier estimates of $89-$93 billion due to weaker than expected iPhone sales. The announcement caused a significant drop in Apple's stock price and negatively impacted related suppliers, leading to broader market declines for tech indices such as Nasdaq 10 41.740002 42.244999 41.482498 40.246914 130672400 -1 323 appl lower it fiscal q revenu guidanc to billion from earlier estim of billion due to weaker than expect iphon sale the announc caus a signific drop in appl stock price and neg impact relat supplier lead to broader market declin for tech indic such as nasdaq
2 2019-01-02 Apple cut its fiscal first quarter revenue forecast from $89-$93 billion to $84 billion due to weaker demand in China and fewer iPhone upgrades. CEO Tim Cook also mentioned constrained sales of Airpods and Macbooks. Apple's shares fell 8.5% in post market trading, while Asian suppliers like Hon 41.740002 42.244999 41.482498 40.246914 130672400 -1 296 appl cut it fiscal first quarter revenu forecast from billion to billion due to weaker demand in china and fewer iphon upgrad ceo tim cook also mention constrain sale of airpod and macbook appl share fell in post market trade while asian supplier like hon
3 2019-01-02 This news article reports that yields on long-dated U.S. Treasury securities hit their lowest levels in nearly a year on January 2, 2019, due to concerns about the health of the global economy following weak economic data from China and Europe, as well as the partial U.S. government shutdown. Apple 41.740002 42.244999 41.482498 40.246914 130672400 -1 300 thi news articl report that yield on longdat us treasuri secur hit their lowest level in nearli a year on januari due to concern about the health of the global economi follow weak econom data from china and europ as well as the partial us govern shutdown appl
4 2019-01-02 Apple's revenue warning led to a decline in USD JPY pair and a gain in Japanese yen, as investors sought safety in the highly liquid currency. Apple's underperformance in Q1, with forecasted revenue of $84 billion compared to analyst expectations of $91.5 billion, triggered risk aversion mood in markets 41.740002 42.244999 41.482498 40.246914 130672400 -1 305 appl revenu warn led to a declin in usd jpi pair and a gain in japanes yen as investor sought safeti in the highli liquid currenc appl underperform in q with forecast revenu of billion compar to analyst expect of billion trigger risk avers mood in market

Please Note : Moved splitting the dataset after word embeddings

Word Embeddings

Word2Vec

In [58]:
# Creating a list of all words in our data
words_list = [item.split(" ") for item in dataset['News_clean'].values]
In [59]:
# Creating an instance of Word2Vec
vec_size = 300
model_W2V = Word2Vec(words_list, vector_size = vec_size, min_count = 1, window=5, workers = 6)
In [60]:
# Checking the size of the vocabulary
print("Length of the vocabulary is", len(list(model_W2V.wv.key_to_index)))
Length of the vocabulary is 2595

Let's check out a few word embeddings obtained using the model

In [61]:
# Checking the word embedding of a random word
word = "market"
model_W2V.wv[word]
Out[61]:
array([-0.02619236,  0.10335973,  0.01207464,  0.02755376,  0.00298552,
       -0.18407145,  0.12158205,  0.38889915,  0.04796616, -0.03685791,
        0.02456207, -0.1171341 ,  0.02555677, -0.05101335, -0.11536811,
       -0.17199524,  0.05409845, -0.02532996,  0.09496446, -0.07704327,
       -0.04800744, -0.00629101,  0.10416293,  0.00923587,  0.08981215,
        0.02518637, -0.18417688,  0.04103204, -0.08342238, -0.14727788,
        0.00906451, -0.04796242,  0.02504796, -0.02937202, -0.03566457,
        0.0761544 ,  0.07735652, -0.18193936, -0.01028945,  0.01900533,
       -0.09667405,  0.02652544,  0.02958102, -0.11190179,  0.05965571,
        0.13110532,  0.07315437,  0.05403366, -0.03933836,  0.12233776,
        0.03553516,  0.00790851, -0.10131407, -0.00053939, -0.0209395 ,
        0.17489183,  0.08686557,  0.04414649,  0.09752031, -0.00081837,
       -0.09260087, -0.01635911, -0.00280022,  0.01870501, -0.02184183,
        0.10930146, -0.0349811 ,  0.02956992, -0.11956186, -0.06160917,
        0.01062812,  0.10010357,  0.19747746, -0.09236382,  0.02000702,
        0.06065384, -0.15166365,  0.03100675, -0.09374852,  0.13000458,
       -0.10413231, -0.09952228,  0.01727775,  0.34102163,  0.02234886,
        0.04078164, -0.05841886,  0.00548456,  0.14157537,  0.05043319,
        0.1489149 , -0.08882129,  0.08824223, -0.03796206,  0.16127615,
        0.15202564,  0.12484193, -0.03920582, -0.07172115,  0.10503875,
        0.00657613,  0.0499927 ,  0.08457315,  0.03977159,  0.05178807,
       -0.1058948 ,  0.00194093, -0.00505755, -0.1761645 ,  0.0596185 ,
       -0.18169789, -0.07340911,  0.01203382,  0.1231411 ,  0.08654603,
        0.07140123, -0.02032464, -0.02004059,  0.16995844, -0.17504582,
        0.05316987,  0.10536989,  0.08791533,  0.0174938 , -0.0598908 ,
        0.12409091,  0.04788527, -0.15639025,  0.01701211,  0.1423115 ,
        0.10618278,  0.16454045,  0.02994653, -0.18959007,  0.09779719,
        0.09497178, -0.02746745, -0.03337078, -0.16707538, -0.18352774,
        0.02413684, -0.21632655, -0.09280505,  0.1208085 ,  0.12081917,
       -0.07369611, -0.18035388, -0.07769948,  0.11519643, -0.09687529,
        0.00388458, -0.22815196, -0.1178909 , -0.07921895, -0.00449227,
        0.05491902, -0.15742344, -0.0843347 , -0.02245857,  0.16364637,
        0.01531769,  0.13008273, -0.1741396 ,  0.13575824, -0.09871738,
        0.02526765,  0.01886654,  0.0065276 ,  0.0443528 ,  0.28837565,
       -0.07379136,  0.03718238,  0.09082231,  0.03945066, -0.0443172 ,
        0.03815142,  0.02836527, -0.11519879, -0.00564008, -0.06379545,
       -0.0836743 ,  0.03015617, -0.12494174, -0.09863839, -0.10060435,
        0.03220685,  0.18820445,  0.18484013,  0.02374544, -0.18870714,
        0.03525112,  0.03589362, -0.15393758,  0.02940344,  0.06512734,
       -0.12427989,  0.06093823, -0.11765807,  0.08400398, -0.02097673,
       -0.11746108,  0.04997431, -0.02397075, -0.13003474,  0.01751794,
       -0.06986126, -0.01124742,  0.04735342, -0.00452331, -0.06228022,
       -0.01054301, -0.17940722, -0.10368105, -0.07261922,  0.09024487,
       -0.15348522, -0.04659303, -0.28362328, -0.20744316, -0.16721927,
        0.0900349 ,  0.05320412, -0.04013585, -0.09091042, -0.11475689,
       -0.05854544,  0.03884745, -0.02788538, -0.1514323 ,  0.09320636,
        0.10248937, -0.08022905, -0.12560503,  0.08131683, -0.13869728,
        0.0520396 ,  0.02178802,  0.06536731,  0.03266282, -0.26889437,
        0.04702351, -0.05050156, -0.07254007,  0.01203093,  0.03538652,
       -0.15728356, -0.01008521,  0.03628449,  0.03542751,  0.13804068,
        0.04623249,  0.05980511,  0.07657038,  0.01457062, -0.20258264,
       -0.11101552,  0.22740401,  0.0493133 , -0.2480659 , -0.11973819,
        0.13049488,  0.07475292,  0.09948917, -0.17374341, -0.13971074,
        0.03311826,  0.02140839,  0.07404722, -0.12163269,  0.03029039,
       -0.07560678,  0.02237209, -0.00801008, -0.0119422 ,  0.17496045,
        0.05044008,  0.11795008,  0.10790506, -0.16225208,  0.00778078,
        0.09917872,  0.01238052, -0.04234918,  0.06096548,  0.01118987,
        0.02505489, -0.18328585,  0.11144482,  0.02267346,  0.12413804,
        0.06061259,  0.16402134,  0.15492448, -0.00614087,  0.2055838 ,
        0.18517585,  0.03203626, -0.0898002 ,  0.13831933, -0.03188398],
      dtype=float32)
In [62]:
# Checking the word embedding of a random word
word = "stock"
model_W2V.wv[word]
Out[62]:
array([-0.02045205,  0.10463835,  0.01049446,  0.02802964,  0.00188161,
       -0.17688468,  0.11972414,  0.3783331 ,  0.04554815, -0.03532047,
        0.02425865, -0.1178901 ,  0.02426289, -0.0504083 , -0.11628226,
       -0.16344367,  0.05158539, -0.02283507,  0.08945501, -0.07324712,
       -0.0505218 , -0.01066491,  0.10034809,  0.00751636,  0.08666476,
        0.02326753, -0.17598642,  0.03907487, -0.08453679, -0.14542647,
        0.00849416, -0.0464244 ,  0.03064795, -0.03259274, -0.03925902,
        0.07780359,  0.07672428, -0.17327309, -0.01501076,  0.01965822,
       -0.09393512,  0.02784861,  0.03284603, -0.11282644,  0.06239362,
        0.1224938 ,  0.07322551,  0.05161165, -0.03844189,  0.11791036,
        0.03362389,  0.0088425 , -0.10191637, -0.00239066, -0.0185034 ,
        0.16622564,  0.08196869,  0.04112235,  0.0970806 , -0.00468619,
       -0.09072956, -0.01807338, -0.00292329,  0.02017715, -0.01744927,
        0.10350718, -0.03272745,  0.02909189, -0.11540888, -0.0610826 ,
        0.01538841,  0.09567395,  0.19083723, -0.08726043,  0.02259254,
        0.0619636 , -0.14502761,  0.02656754, -0.08611565,  0.12670913,
       -0.10490236, -0.10229662,  0.01603034,  0.3292971 ,  0.02146432,
        0.03508835, -0.05451316,  0.00807703,  0.13910711,  0.04994658,
        0.1408412 , -0.08448265,  0.08841845, -0.03697748,  0.1589433 ,
        0.14586315,  0.12100307, -0.03538205, -0.07100576,  0.10522498,
        0.00516061,  0.04742294,  0.08314201,  0.03655434,  0.04618177,
       -0.10596666,  0.00187082, -0.0048003 , -0.17029592,  0.05929662,
       -0.175495  , -0.07277106,  0.00761318,  0.11644186,  0.08546568,
        0.07147191, -0.01725175, -0.01979659,  0.1643745 , -0.17221013,
        0.04607587,  0.10826748,  0.09097169,  0.01151972, -0.05857029,
        0.12741986,  0.0422319 , -0.1521575 ,  0.01301505,  0.13939339,
        0.10570454,  0.15932116,  0.03447146, -0.18597761,  0.08992431,
        0.09098686, -0.03211248, -0.03698695, -0.1672948 , -0.17652923,
        0.02362618, -0.21079697, -0.09203649,  0.11838674,  0.118288  ,
       -0.07340016, -0.17125826, -0.08138511,  0.10891545, -0.09187808,
        0.00234004, -0.22664328, -0.11477372, -0.07808892, -0.00551528,
        0.05532292, -0.15411516, -0.07829029, -0.01957019,  0.15551125,
        0.01873701,  0.12493234, -0.16943258,  0.13537113, -0.09492893,
        0.02865508,  0.01423172,  0.00220067,  0.04203697,  0.27942678,
       -0.07313751,  0.03603896,  0.09079587,  0.03798223, -0.04257239,
        0.0390996 ,  0.02300265, -0.10837867, -0.00884837, -0.06048789,
       -0.08265808,  0.02956101, -0.12404118, -0.09378724, -0.10015236,
        0.03417347,  0.18431969,  0.17659491,  0.01806804, -0.18959492,
        0.03549553,  0.03222268, -0.14770186,  0.02522885,  0.06802344,
       -0.12128899,  0.06180378, -0.11371268,  0.07716557, -0.02152248,
       -0.11427596,  0.04455805, -0.0259778 , -0.12279144,  0.01478992,
       -0.06499128, -0.01361069,  0.04515454, -0.00533577, -0.06194619,
       -0.00921305, -0.17302366, -0.10289591, -0.07017974,  0.08795715,
       -0.15039644, -0.04882947, -0.27424124, -0.1995655 , -0.1590161 ,
        0.08838901,  0.04615956, -0.03526191, -0.08533111, -0.10990509,
       -0.05674572,  0.04154543, -0.0272767 , -0.1429155 ,  0.09491542,
        0.09897585, -0.07775412, -0.11787853,  0.08265383, -0.13059482,
        0.0510731 ,  0.02353811,  0.06655521,  0.03058239, -0.26058674,
        0.04732045, -0.05237174, -0.07162559,  0.0100246 ,  0.03464274,
       -0.15074444, -0.00680452,  0.0330142 ,  0.03480177,  0.13436426,
        0.04397383,  0.06208503,  0.07498986,  0.01098262, -0.19721322,
       -0.11013929,  0.22319745,  0.04865558, -0.2409784 , -0.11554358,
        0.1280164 ,  0.07089652,  0.09695372, -0.16705124, -0.13710481,
        0.03131097,  0.02416922,  0.07361362, -0.12309311,  0.03480978,
       -0.07485258,  0.01901963, -0.00584464, -0.00996458,  0.16755328,
        0.04770495,  0.11689662,  0.10432884, -0.15498532,  0.01189436,
        0.09870256,  0.00819654, -0.03681112,  0.05984482,  0.01218816,
        0.02290365, -0.17980376,  0.10751224,  0.02204006,  0.12219343,
        0.05466993,  0.15564224,  0.15127826, -0.0013412 ,  0.19889276,
        0.1817902 ,  0.02821191, -0.09358677,  0.13161588, -0.03100482],
      dtype=float32)
In [63]:
# Checking the word embedding of a random word
word = "analyst"
model_W2V.wv[word]
Out[63]:
array([-0.01467016,  0.05792558,  0.00958164,  0.01494685,  0.00087817,
       -0.10193085,  0.06885614,  0.22184016,  0.02858982, -0.02007319,
        0.01562253, -0.0671638 ,  0.01015965, -0.03194374, -0.06489472,
       -0.09824303,  0.0320942 , -0.01624677,  0.0500955 , -0.04532096,
       -0.03032446, -0.00367097,  0.06326325,  0.00513419,  0.05316016,
        0.01258489, -0.10054068,  0.02463312, -0.04508311, -0.08145104,
        0.00983864, -0.02494508,  0.01676262, -0.01459704, -0.02279421,
        0.04286728,  0.04426806, -0.101835  , -0.00460594,  0.00779183,
       -0.05259016,  0.01437992,  0.02111489, -0.06423169,  0.0362953 ,
        0.07403231,  0.04477731,  0.02683544, -0.02113187,  0.06907009,
        0.02312756,  0.00560825, -0.05911102, -0.00085519, -0.01041416,
        0.0971791 ,  0.04575061,  0.02281723,  0.0560443 , -0.00250501,
       -0.0514957 , -0.01036808,  0.00169308,  0.0114444 , -0.01012028,
        0.06170741, -0.02040357,  0.01607895, -0.06929877, -0.03676591,
        0.0086394 ,  0.05740441,  0.11262466, -0.05526635,  0.01444309,
        0.03265646, -0.08378455,  0.01540259, -0.04934547,  0.07248114,
       -0.06139494, -0.05592132,  0.01168767,  0.19624871,  0.01016025,
        0.02444555, -0.03221016,  0.00386918,  0.08102876,  0.031344  ,
        0.08338397, -0.04971393,  0.05333282, -0.02496559,  0.09177975,
        0.08529814,  0.07154905, -0.02106351, -0.04291227,  0.06140613,
        0.00597968,  0.0260494 ,  0.04628672,  0.02329222,  0.02688428,
       -0.06441543,  0.00089695, -0.00429781, -0.10055525,  0.03368742,
       -0.10187161, -0.04499821,  0.00734153,  0.06759898,  0.04761869,
        0.04127934, -0.00840395, -0.01508187,  0.09578724, -0.10077456,
        0.02708978,  0.06345913,  0.05380329,  0.00814818, -0.03203223,
        0.06881856,  0.02643014, -0.09108785,  0.00753296,  0.07814651,
        0.06213457,  0.09083541,  0.02003997, -0.10330734,  0.05534565,
        0.0511078 , -0.01969872, -0.02030237, -0.09400326, -0.10141897,
        0.01513183, -0.12126713, -0.05160775,  0.06750025,  0.06594492,
       -0.03921591, -0.10200129, -0.04618069,  0.06314383, -0.05185737,
       -0.00122579, -0.1315288 , -0.06506275, -0.04377221,  0.00214082,
        0.03539197, -0.08612878, -0.04404276, -0.00928751,  0.08958255,
        0.01127289,  0.07091732, -0.1012926 ,  0.07644235, -0.05658731,
        0.01208207,  0.00954076, -0.00082737,  0.02849064,  0.1602634 ,
       -0.04023264,  0.01818437,  0.05283142,  0.02330061, -0.02312855,
        0.01857226,  0.01788178, -0.06319873, -0.00654047, -0.03507867,
       -0.05190117,  0.01890312, -0.06785048, -0.05360259, -0.06136593,
        0.01877665,  0.10840422,  0.10196368,  0.00959081, -0.10942505,
        0.02401055,  0.01792582, -0.08684658,  0.0173627 ,  0.0379563 ,
       -0.06915018,  0.03746324, -0.0638876 ,  0.04713277, -0.01433117,
       -0.06745148,  0.02762845, -0.01621138, -0.072202  ,  0.01204052,
       -0.03831299, -0.00904568,  0.02983517, -0.00706787, -0.0347115 ,
       -0.0080327 , -0.09775317, -0.05940894, -0.04253476,  0.05125591,
       -0.08254886, -0.02646293, -0.1599408 , -0.11566386, -0.09180358,
        0.04744174,  0.02698367, -0.0190151 , -0.0498678 , -0.06616588,
       -0.03072428,  0.02394861, -0.0123154 , -0.08502781,  0.05346349,
        0.05655401, -0.04365355, -0.06949561,  0.0460778 , -0.07565547,
        0.02643673,  0.01092304,  0.03420207,  0.01540187, -0.15415245,
        0.02498   , -0.02993912, -0.04098215,  0.00629773,  0.01750438,
       -0.08532327, -0.00308806,  0.02091251,  0.0178059 ,  0.08117772,
        0.02644905,  0.03619274,  0.03901035,  0.00652096, -0.11644629,
       -0.06467701,  0.12959386,  0.02819719, -0.13906902, -0.06750243,
        0.07522574,  0.04134918,  0.05833499, -0.09537624, -0.07665998,
        0.01906474,  0.01632199,  0.03962027, -0.06904852,  0.01608727,
       -0.04616297,  0.01520503, -0.00581791, -0.00381878,  0.09575941,
        0.02859314,  0.06525371,  0.06246334, -0.09214754,  0.00879996,
        0.05797821,  0.00366962, -0.02161061,  0.03973207,  0.00730406,
        0.01718109, -0.10291766,  0.06206887,  0.01299445,  0.07081372,
        0.02992508,  0.09363092,  0.09041566, -0.00337366,  0.11981707,
        0.10672121,  0.01415992, -0.05405848,  0.07668816, -0.01740389],
      dtype=float32)
In [64]:
# Retrieving the words present in the Word2Vec model's vocabulary
words = list(model_W2V.wv.key_to_index.keys())

# Retrieving word vectors for all the words present in the model's vocabulary
wvs = model_W2V.wv[words].tolist()

# Creating a dictionary of words and their corresponding vectors
word_vector_dict = dict(zip(words, wvs))
In [65]:
def average_vectorizer_Word2Vec(doc):
    # Initializing a feature vector for the sentence
    feature_vector = np.zeros((vec_size,), dtype="float64")

    # Creating a list of words in the sentence that are present in the model vocabulary
    words_in_vocab = [word for word in doc.split() if word in words]

    # adding the vector representations of the words
    for word in words_in_vocab:
        feature_vector += np.array(word_vector_dict[word])

    # Dividing by the number of words to get the average vector
    if len(words_in_vocab) != 0:
        feature_vector /= len(words_in_vocab)

    return feature_vector
In [66]:
# creating a dataframe of the vectorized documents
df_Word2Vec = pd.DataFrame(dataset['News_clean'].apply(average_vectorizer_Word2Vec).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])
df_Word2Vec
Out[66]:
Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 Feature 7 Feature 8 Feature 9 ... Feature 290 Feature 291 Feature 292 Feature 293 Feature 294 Feature 295 Feature 296 Feature 297 Feature 298 Feature 299
0 -0.023231 0.104832 0.014443 0.029644 0.003267 -0.187060 0.121618 0.394426 0.045144 -0.039256 ... 0.058476 0.165004 0.160281 -0.002721 0.210241 0.187204 0.030844 -0.094192 0.139083 -0.035070
1 -0.023520 0.106946 0.013911 0.030007 0.003044 -0.189970 0.123621 0.401177 0.045844 -0.039967 ... 0.059370 0.168320 0.163380 -0.002691 0.213749 0.191031 0.031464 -0.095518 0.141980 -0.035939
2 -0.020863 0.094849 0.012337 0.026613 0.003182 -0.168639 0.110041 0.356590 0.041184 -0.035720 ... 0.052980 0.149468 0.145151 -0.002177 0.190340 0.169539 0.027869 -0.084681 0.125854 -0.031749
3 -0.022405 0.103868 0.013058 0.029194 0.003450 -0.184169 0.119611 0.388012 0.045052 -0.038510 ... 0.057815 0.163025 0.158468 -0.002329 0.207447 0.184948 0.029809 -0.092458 0.137094 -0.035243
4 -0.022897 0.103972 0.013479 0.028805 0.003438 -0.185085 0.120638 0.391170 0.045221 -0.038815 ... 0.057878 0.163757 0.159298 -0.002324 0.208850 0.185903 0.030606 -0.092743 0.138221 -0.034964
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
344 -0.015645 0.072553 0.009751 0.020355 0.002419 -0.128479 0.083816 0.270218 0.031870 -0.026731 ... 0.040134 0.113690 0.110101 -0.001840 0.145119 0.128848 0.021204 -0.065025 0.095598 -0.024687
345 -0.016107 0.071856 0.009737 0.020569 0.002668 -0.126501 0.082462 0.268352 0.030791 -0.026652 ... 0.039823 0.112457 0.109965 -0.002014 0.143821 0.126722 0.021422 -0.064380 0.094952 -0.023651
346 -0.018061 0.083265 0.010285 0.023719 0.002729 -0.147200 0.096363 0.310562 0.035942 -0.030593 ... 0.045948 0.130414 0.126228 -0.002038 0.165815 0.147575 0.024195 -0.073604 0.109072 -0.028210
347 -0.021309 0.095980 0.012484 0.027626 0.003232 -0.171044 0.110863 0.360484 0.041748 -0.035786 ... 0.053759 0.151225 0.146544 -0.002474 0.192506 0.171789 0.028149 -0.085957 0.127527 -0.032365
348 -0.022844 0.104147 0.014116 0.029004 0.003281 -0.184846 0.120526 0.390804 0.045331 -0.039281 ... 0.058429 0.164437 0.158798 -0.002245 0.208345 0.186087 0.030491 -0.092539 0.137897 -0.035005

349 rows × 300 columns

Glove

In [67]:
from gensim.models import KeyedVectors
# load the Stanford GloVe model
filename = '/content/drive/MyDrive/Colab Notebooks/NLP/Project/glove.6B.100d.txt.word2vec'
glove_model = KeyedVectors.load_word2vec_format(filename, binary=False)
In [68]:
# Checking the size of the vocabulary
print("Length of the vocabulary is", len(glove_model.index_to_key))
Length of the vocabulary is 400000
In [69]:
# Checking the word embedding of a random word
word = "market"
glove_model[word]
Out[69]:
array([ 0.39093  ,  0.23755  ,  0.44855  ,  0.11237  , -0.25996  ,
       -1.2248   , -0.44237  , -0.53491  ,  0.37142  , -0.61981  ,
       -0.27387  , -0.032213 ,  0.082629 , -0.52986  ,  0.13012  ,
        0.21703  , -0.45026  , -0.0048895,  0.34887  , -0.26069  ,
        0.56598  , -0.36219  ,  0.41926  ,  0.23441  , -0.29407  ,
       -0.27044  ,  0.29339  , -0.73905  , -0.75965  ,  0.64661  ,
       -0.038757 ,  0.38495  , -0.32314  ,  0.040322 ,  0.24036  ,
        0.35167  ,  0.47404  ,  0.014959 ,  0.12105  , -1.0398   ,
        0.27639  , -1.3785   , -0.22851  , -0.098074 ,  0.1495   ,
       -0.2815   ,  0.31682  , -0.10208  , -0.08586  , -1.5114   ,
       -0.48255  ,  0.15131  ,  0.0080133,  0.74594  , -0.20163  ,
       -2.5268   , -0.82083  ,  0.1143   ,  2.4665   ,  0.19841  ,
        0.1146   ,  0.10083  , -0.60936  ,  0.76722  ,  0.025978 ,
       -0.036936 ,  0.46744  , -0.77073  ,  0.83992  , -0.032931 ,
       -0.13127  , -0.097367 , -0.42634  , -0.49478  , -0.40796  ,
       -0.67504  , -0.28535  ,  0.12474  , -1.145    , -0.43059  ,
        1.172    ,  0.40749  , -0.83089  ,  0.41675  , -0.83018  ,
       -0.88716  , -0.59827  , -0.56652  , -0.2275   , -0.42398  ,
        0.63385  ,  0.62035  , -0.13429  , -0.49012  , -0.78362  ,
        0.85838  ,  0.60102  , -0.40596  ,  0.77826  ,  1.105    ],
      dtype=float32)
In [70]:
# Checking the word embedding of a random word
word = "stock"
glove_model[word]
Out[70]:
array([ 8.6341e-01,  6.9648e-01,  4.5794e-02, -9.5708e-03, -2.5498e-01,
       -7.4666e-01, -2.2086e-01, -4.4615e-01, -1.0423e-01, -9.9931e-01,
        7.2550e-02,  4.5049e-01, -5.9912e-02, -5.7837e-01, -4.6540e-01,
        4.3429e-02, -5.0570e-01, -1.5442e-01,  9.8250e-01, -8.1571e-02,
        2.6523e-01, -2.3734e-01,  9.7675e-02,  5.8588e-01, -1.2948e-01,
       -6.8956e-01, -1.2811e-01, -5.2265e-02, -6.7719e-01,  3.0190e-02,
        1.8058e-01,  8.6121e-01, -8.3206e-01, -5.6887e-02, -2.9578e-01,
        4.7180e-01,  1.2811e+00, -2.5228e-01,  4.9557e-02, -7.2455e-01,
        6.6758e-01, -1.1091e+00, -2.0493e-01, -5.8669e-01, -2.5375e-03,
        8.2777e-01, -4.9102e-01, -2.6475e-01,  4.3015e-01, -2.0516e+00,
       -3.3208e-01,  5.1845e-02,  5.2646e-01,  8.7452e-01, -9.0237e-01,
       -1.7366e+00, -3.4727e-01,  1.6590e-01,  2.7727e+00,  6.5756e-02,
       -4.0363e-01,  3.8252e-01, -3.0787e-01,  5.9202e-01,  1.3468e-01,
       -3.3851e-01,  3.3646e-01,  2.0950e-01,  8.5905e-01,  5.1865e-01,
       -1.0657e+00, -2.6371e-02, -3.1349e-01,  2.3231e-01, -7.0192e-01,
       -5.5737e-01, -2.3418e-01,  1.3563e-01, -1.0016e+00, -1.4221e-01,
        1.0372e+00,  3.5880e-01, -4.2608e-01, -1.9386e-01, -3.7867e-01,
       -6.9646e-01, -3.9989e-01, -5.7782e-01,  1.0132e-01,  2.0123e-01,
       -3.7153e-01,  5.0837e-01, -3.7758e-01, -2.6205e-01, -9.3676e-01,
        1.0053e+00,  8.4393e-01, -2.4698e-01,  1.7339e-01,  9.4473e-01],
      dtype=float32)
In [71]:
# Checking the word embedding of a random word
word = "analyst"
glove_model[word]
Out[71]:
array([-0.68912 , -0.61502 , -0.075303, -0.90185 , -0.21638 , -1.6388  ,
       -0.35845 , -0.6202  , -0.46533 ,  0.24556 , -0.22849 ,  0.36161 ,
       -0.28079 , -0.25722 , -0.034761, -0.22908 , -0.083752, -0.66219 ,
        0.1797  ,  0.32527 ,  0.026944,  0.047801,  0.33888 ,  0.53238 ,
        0.15211 ,  0.29341 ,  0.068226, -0.23673 , -0.72906 ,  1.0928  ,
       -0.34875 ,  0.35338 , -0.33324 , -0.078432,  0.14354 , -0.4008  ,
        0.64908 ,  1.2761  ,  0.3117  , -0.73472 ,  0.57459 , -0.77899 ,
       -1.4682  , -0.060967, -0.20749 , -0.86404 , -0.33615 , -0.42517 ,
       -1.1664  , -0.83425 ,  0.92064 , -0.80822 ,  0.29501 , -0.082142,
        0.43682 , -2.3248  , -0.97519 ,  0.42744 ,  0.22636 ,  0.9386  ,
        0.31505 , -0.54591 , -0.25594 , -0.23809 , -0.25933 , -0.46299 ,
        0.56394 ,  1.1346  ,  0.78818 ,  1.2429  , -0.82584 ,  0.37498 ,
       -0.67665 , -0.98825 , -0.094105, -1.175   ,  0.029294, -0.26227 ,
       -0.84071 , -0.66487 ,  1.4958  ,  0.12404 , -0.27479 , -0.28074 ,
       -0.26273 , -0.61165 , -0.68797 , -0.40587 ,  0.93194 , -0.62827 ,
        0.759   ,  0.095698,  0.23164 , -0.86261 , -0.34132 ,  0.70849 ,
        0.37374 , -0.22845 , -0.92452 ,  0.95916 ], dtype=float32)
In [72]:
# Retrieving the words present in the GloVe model's vocabulary
glove_words = glove_model.index_to_key

# Creating a dictionary of words and their corresponding vectors
glove_word_vector_dict = dict(zip(glove_model.index_to_key,list(glove_model.vectors)))
In [73]:
vec_size=100
In [74]:
def average_vectorizer_GloVe(doc):
    # Initializing a feature vector for the sentence
    feature_vector = np.zeros((vec_size,), dtype="float64")

    # Creating a list of words in the sentence that are present in the model vocabulary
    words_in_vocab = [word for word in doc.split() if word in glove_words]

    # adding the vector representations of the words
    for word in words_in_vocab:
        feature_vector += np.array(glove_word_vector_dict[word])

    # Dividing by the number of words to get the average vector
    if len(words_in_vocab) != 0:
        feature_vector /= len(words_in_vocab)

    return feature_vector
In [75]:
# creating a dataframe of the vectorized documents
df_Glove = pd.DataFrame(dataset['News_clean'].apply(average_vectorizer_GloVe).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])
df_Glove
Out[75]:
Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 Feature 7 Feature 8 Feature 9 ... Feature 90 Feature 91 Feature 92 Feature 93 Feature 94 Feature 95 Feature 96 Feature 97 Feature 98 Feature 99
0 -0.025091 0.042967 0.116176 -0.099245 -0.099562 -0.266941 -0.129265 0.080286 -0.090646 0.008517 ... -0.017063 0.156885 -0.173618 0.017792 -0.351731 0.080720 -0.009300 -0.071766 0.400022 0.046329
1 0.075789 0.279267 0.286092 -0.077898 -0.022339 -0.379945 -0.178142 -0.061498 -0.161026 0.088777 ... -0.016551 0.146223 -0.238209 -0.091082 -0.463150 0.093832 0.016530 -0.174671 0.531049 -0.026343
2 0.014897 0.207172 0.331676 -0.114473 0.116600 -0.374662 -0.168155 -0.010315 -0.086171 0.033631 ... 0.120662 0.082600 -0.143998 -0.157486 -0.532710 0.129624 -0.030218 -0.157017 0.557803 -0.121953
3 -0.090954 0.123357 0.444133 -0.051370 0.011666 -0.228597 -0.246173 0.033390 -0.150130 0.002191 ... 0.082228 0.127173 -0.272930 0.134984 -0.438773 0.074060 -0.046727 -0.261468 0.554238 0.067364
4 -0.016286 0.095670 0.158662 0.009404 0.022072 -0.162877 -0.133161 -0.037780 -0.213474 0.109459 ... 0.076108 0.076419 -0.141350 -0.127092 -0.298950 0.181038 0.048829 -0.186550 0.361910 -0.034064
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
344 -0.089744 0.078603 0.355204 -0.269606 0.093859 0.229220 -0.084779 0.248824 -0.119752 0.015531 ... 0.113815 -0.164158 0.054945 0.112785 -0.391273 0.064565 -0.150172 -0.331076 0.552668 0.277021
345 0.153751 0.155163 0.296305 0.042006 0.105729 -0.252609 -0.260229 -0.007662 -0.240935 -0.035625 ... 0.056417 0.102090 -0.138025 0.059017 -0.511696 0.348926 0.062021 -0.030892 0.483785 0.106074
346 0.033072 0.072522 0.241457 -0.146820 -0.050864 -0.095216 -0.124294 0.112551 -0.242520 -0.060998 ... -0.011031 0.144369 -0.168728 0.120561 -0.412585 -0.009892 -0.123141 -0.258367 0.350678 0.055361
347 -0.113620 0.056063 0.216818 -0.095542 0.004862 -0.195284 -0.223138 0.078362 -0.195751 -0.034510 ... 0.025093 0.007117 -0.140764 -0.045426 -0.433007 0.026847 0.035699 -0.281933 0.497955 0.019963
348 0.008440 0.193835 0.308262 -0.132214 0.036220 -0.192029 -0.036471 0.145628 -0.261896 0.041172 ... 0.035777 0.057962 -0.283714 0.201147 -0.489115 0.031632 -0.063183 -0.214501 0.710835 -0.138350

349 rows × 100 columns

Sentence Transformer

In [76]:
# defining the model
sent_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
WARNING:huggingface_hub.file_download:Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
In [77]:
# setting the device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
In [78]:
# encoding the dataset
embedding_matrix = sent_model.encode(dataset['News'], device=device, show_progress_bar=True)
In [79]:
# printing the shape of the embedding matrix
embedding_matrix.shape
Out[79]:
(349, 384)
In [80]:
# printing the embedding vector of the first news in the dataset
embedding_matrix[0,:]
Out[80]:
array([-2.02309177e-03, -3.67734618e-02,  7.73542747e-02,  4.67134453e-02,
        3.25521417e-02,  2.10233801e-03,  4.32835072e-02,  3.95345204e-02,
        5.82280159e-02,  8.87509063e-03,  7.09636658e-02,  4.99076769e-02,
        6.46608025e-02, -4.97968402e-03, -1.30518908e-02, -2.98355650e-02,
       -8.91323574e-03, -7.82000944e-02, -2.17109080e-02, -5.24822213e-02,
       -5.14276288e-02, -3.30719948e-02, -3.32051441e-02,  4.18125466e-02,
        7.99547806e-02,  1.54092424e-02, -2.15781108e-02,  5.19438200e-02,
       -4.65799235e-02, -3.71372327e-02, -1.04225501e-01,  9.86079052e-02,
        5.21786362e-02,  3.46578658e-02,  1.48810344e-02, -4.47350834e-03,
        5.70118129e-02, -2.41722800e-02,  2.14049425e-02, -6.52145073e-02,
       -3.30645256e-02,  1.61961466e-02, -6.63142055e-02,  4.39943708e-02,
        3.82152535e-02, -4.86519262e-02,  1.62651259e-02, -4.02665511e-02,
       -3.34573910e-03,  3.20956036e-02, -3.91194131e-03, -1.26830200e-02,
        4.49699126e-02,  3.39313671e-02, -5.17101884e-02,  5.32578491e-02,
       -4.53628749e-02, -2.19655950e-02,  7.14350790e-02,  3.10435575e-02,
        4.33289595e-02, -9.26511288e-02,  1.73019301e-02, -8.11784994e-03,
        6.26880899e-02, -1.42353177e-02, -4.14227284e-02,  5.68226427e-02,
       -8.62419158e-02,  6.02036081e-02,  4.75493595e-02, -7.17870370e-02,
        1.17750159e-02,  4.39625345e-02, -3.44558433e-02,  3.67184468e-02,
        5.21756709e-02, -2.13999804e-02, -5.96953230e-03,  8.72080214e-03,
        3.53134945e-02, -3.95235531e-02, -9.36868563e-02,  1.56855509e-02,
       -3.84228416e-02,  2.47945096e-02,  3.62802707e-02, -1.32978121e-02,
       -4.67082337e-02, -5.69417253e-02,  6.84317946e-02, -2.25544814e-03,
        1.43187251e-02, -1.19596459e-02,  5.61927631e-02, -5.40325716e-02,
        1.84007213e-02, -9.87436175e-02,  1.36832241e-02,  1.76982582e-02,
        4.83816266e-02,  3.39467973e-02,  5.20425662e-02, -3.35446261e-02,
       -4.39126268e-02, -9.73524824e-02,  4.13890071e-02,  2.77672848e-03,
        3.94194983e-02,  1.73664466e-02, -2.37191003e-02,  1.02213910e-02,
       -5.20212278e-02, -4.82434891e-02, -6.77878112e-02,  7.58206993e-02,
       -1.03107551e-02,  6.10160418e-02, -4.92668375e-02, -1.84614584e-02,
       -5.64168356e-02,  6.09037392e-02, -4.93266992e-03,  1.02207698e-02,
        3.61768007e-02,  1.62944682e-02, -1.01151690e-01, -9.37191799e-34,
        1.30963475e-02,  2.90461443e-02, -5.68401217e-02, -1.07233785e-02,
        3.43215726e-02, -5.42572662e-02,  6.12290055e-02,  6.73199119e-03,
       -6.39355630e-02, -3.71847749e-02, -4.51456718e-02,  5.50827235e-02,
       -5.48340790e-02, -9.50716212e-02,  7.80199990e-02, -1.12085953e-01,
       -5.90517446e-02, -8.70309211e-03,  6.69128969e-02, -2.09152456e-02,
       -4.48345207e-02, -6.54891506e-02, -5.31035624e-02,  6.96216971e-02,
        8.78787693e-03,  7.72856874e-04,  1.89880596e-03,  3.55077945e-02,
        1.91084538e-02,  2.86549609e-02, -5.54806665e-02,  6.67084381e-02,
        2.76121050e-02, -1.35227069e-01, -2.72631720e-02, -8.76489375e-03,
       -6.54005632e-02,  2.73344759e-02,  8.32478777e-02, -4.15787883e-02,
       -6.82980865e-02,  6.94950894e-02, -5.72324879e-02,  5.30221441e-04,
       -4.17724531e-03, -3.16040814e-02,  4.05772552e-02, -4.03362736e-02,
       -3.90767269e-02, -3.31442244e-02, -6.78732544e-02,  1.14779584e-02,
        2.73937434e-02, -2.94763036e-02,  5.07603213e-02, -1.22913979e-02,
        4.15568314e-02, -5.26328869e-02, -1.52790230e-02,  6.56477585e-02,
        7.96087608e-02,  7.34060928e-02, -4.55296561e-02,  4.26630005e-02,
       -7.28154033e-02,  8.61650109e-02,  9.41243917e-02,  3.60598452e-02,
       -1.42589882e-01,  1.10695474e-01, -2.01496389e-02, -4.56182845e-02,
       -2.93876193e-02,  6.98730443e-03,  5.36932498e-02, -1.35457972e-02,
       -8.81077126e-02, -2.48124208e-02, -4.63529257e-03, -9.48696584e-02,
        6.09889627e-02, -3.14058922e-02,  3.15861292e-02,  2.20643394e-02,
        3.89253385e-02, -7.49614695e-03,  8.55575204e-02, -4.86009791e-02,
       -1.11082266e-03,  5.33105992e-03, -5.82354777e-02, -4.99342680e-02,
       -2.81568710e-02,  5.94846085e-02, -6.10694010e-03, -7.56998985e-34,
       -4.94262613e-02,  1.74969807e-02, -2.44862027e-02,  2.48289593e-02,
       -8.59439969e-02, -1.83839984e-02,  5.83668426e-02,  6.86293468e-03,
        1.76914558e-02,  6.17085434e-02,  4.79522869e-02,  4.48756665e-02,
       -9.16801468e-02,  1.66510455e-02, -3.22471447e-02, -3.55336117e-03,
        6.30989522e-02, -1.44052163e-01,  4.32584397e-02, -7.65528381e-02,
        6.65068477e-02, -3.25963423e-02, -2.85959281e-02, -2.90052462e-02,
       -5.33847436e-02,  6.87323734e-02, -5.31518739e-03,  7.13094473e-02,
       -1.95373576e-02, -4.22351575e-03, -7.53196552e-02, -7.40312785e-02,
        1.34283397e-02,  6.54502064e-02,  7.09960386e-02,  3.40935471e-03,
       -3.15817483e-02,  1.96606643e-03, -2.51394846e-02, -2.93941684e-02,
        5.40001616e-02,  1.96979027e-02,  2.19472963e-02,  1.68580841e-02,
        2.46497449e-02, -4.93298471e-02,  3.24860141e-02,  1.54560590e-02,
        9.43609327e-02, -1.62996091e-02,  9.13446583e-03,  4.94888909e-02,
        2.44886819e-02,  7.70643353e-02, -6.84276819e-02,  3.09738480e-02,
       -2.02456862e-03,  6.57485947e-02, -9.38693061e-02,  1.92350131e-02,
        4.56148089e-04, -5.13908677e-02,  3.38721611e-02, -3.80237326e-02,
        2.85909194e-02, -2.24880576e-02,  8.34821090e-02,  4.39711437e-02,
        2.28685681e-02, -4.97476384e-02,  1.22772694e-01, -3.04845013e-02,
        1.06068710e-02, -7.15398639e-02, -3.84447649e-02,  5.95832393e-02,
       -9.02276710e-02,  1.52782053e-02, -5.60677424e-02, -1.62651781e-02,
        1.27833366e-01,  8.89350623e-02,  1.75955170e-03,  4.24433276e-02,
       -3.15954164e-02,  7.07846656e-02,  2.89796889e-02,  2.44632624e-02,
       -1.99160930e-02,  5.09030223e-02, -1.44732781e-02, -7.35410824e-02,
        3.10671516e-02,  7.79150575e-02, -1.30469546e-01, -3.58579904e-08,
       -1.97356082e-02, -7.96370208e-02,  2.42431052e-02, -1.87578313e-02,
        6.77264333e-02, -2.62168553e-02,  1.00161172e-02,  4.15349901e-02,
        1.08690009e-01,  3.81878614e-02, -7.81213790e-02, -1.09490193e-02,
       -1.17989495e-01,  7.97277093e-02,  2.46654470e-02,  9.32541559e-04,
       -7.25726411e-02,  8.93294439e-03,  4.05105203e-02, -1.03700541e-01,
       -6.20271219e-03,  3.49568725e-02,  7.82293826e-02, -2.42466480e-02,
       -3.33973765e-02,  3.95032354e-02,  1.34978374e-03, -7.48429149e-02,
        5.03437966e-02,  5.74476225e-03, -2.39702500e-02,  1.32120331e-03,
        6.57341704e-02, -5.62655404e-02, -6.76572248e-02, -4.22492959e-02,
       -2.64710262e-02,  2.83033270e-02,  4.93411720e-02,  3.14059295e-02,
       -3.22605111e-02, -1.48835173e-02, -7.07752630e-02, -9.63549968e-03,
       -3.38621400e-02, -1.05421506e-02, -5.47045022e-02,  2.66298186e-02,
        5.15929386e-02, -1.93036906e-02,  1.74268503e-02, -2.24287342e-02,
        5.11595339e-04,  2.79130526e-02, -6.66390806e-02, -5.92727102e-02,
        3.25669954e-03, -5.93556964e-04, -5.50960600e-02, -2.71336380e-02,
       -3.57509800e-03, -1.31502151e-01,  7.41634369e-02,  5.75100332e-02],
      dtype=float32)

Data Pre-processing

Splitting the dataset

In [81]:
# Creating dependent and independent variables
X_word2vec = df_Word2Vec.copy()
X_glove = df_Glove.copy()
X_sent_transformer = embedding_matrix.copy()
y=dataset['Label']
In [82]:
def split(X,y):
    # Initial split into training (80%) and testing (20%)
    X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)

    # Further split the temporary set into validation (10%) and test (10%) sets
    X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=42)

    return X_train,X_valid,X_test,y_train,y_valid,y_test
In [83]:
#Splitting the dataset.
X_train_word2vec,X_valid_word2vec,X_test_word2vec,y_train_word2vec,y_valid_word2vec,y_test_word2vec=split(X_word2vec,y)
X_train_glove,X_valid_glove,X_test_glove,y_train_glove,y_valid_glove,y_test_glove=split(X_glove,y)
X_train_sent_transformer,X_valid_sent_transformer,X_test_sent_transformer,y_train_sent_transformer,y_valid_sent_transformer,y_test_sent_transformer=split(X_sent_transformer,y)

Check the shape of training,validation and testing datasets of Word2vec,Glove and Sentence Transformers

In [84]:
print(X_train_word2vec.shape, X_test_word2vec.shape, X_valid_word2vec.shape)
(279, 300) (35, 300) (35, 300)
In [85]:
print(y_train_word2vec.shape, y_test_word2vec.shape, y_valid_word2vec.shape)
(279,) (35,) (35,)
In [86]:
print(X_train_glove.shape, X_test_glove.shape, X_valid_glove.shape)
(279, 100) (35, 100) (35, 100)
In [87]:
print(y_train_glove.shape, y_test_glove.shape, y_valid_glove.shape)
(279,) (35,) (35,)
In [88]:
print(X_train_sent_transformer.shape, X_test_sent_transformer.shape, X_valid_sent_transformer.shape)
(279, 384) (35, 384) (35, 384)
In [89]:
print(y_train_sent_transformer.shape, y_test_sent_transformer.shape, y_valid_sent_transformer.shape)
(279,) (35,) (35,)

Observations:

  • Word2vec datasets
    • Training: 279 rows and each sample is represented by 300 dimensional vector(word2vec embeddings)
    • Validation: validation data has 35 samples, with each sample having a 300-dimensional Word2Vec embedding
    • Testing: Testing data has 35 samples, with each sample having a 300-dimensional Word2Vec embedding
  • Glove datasets
    • Training: 279 rows and each sample is represented by 100 dimensional vector(Glove embeddings)
    • Validation: validation data has 35 samples, with each sample having a 100-dimensional Word2Vec embedding
    • Testing: Testing data has 35 samples, with each sample having a 100-dimensional Word2Vec embedding
  • Sentence Transformer datasets
    • Training: 279 rows and each sample is represented by 300 dimensional vector(Sentence transformer embeddings)
    • Validation: validation data has 35 samples, with each sample having a 300-dimensional Word2Vec embedding
    • Testing: Testing data has 35 samples, with each sample having a 300-dimensional Word2Vec embedding

Sentiment Analysis

Model Evaluation Criterion

Model can make wrong prediction:

  • Model predicting a possitive sentiment when the actual outcome is negative - FALSE POSITIVES
  • Model predicting a negative sentiment when the actual outcome is positive - FALSE NEGATIVE

Which case is more important?

  • Predicting a negative sentiment when the actual outcome is positive could result in missing out on opportunities to buy stocks that would have gained value.
  • There is data imbalance as well in which Recall becomes more critical.

How to reduce this loss?

  • Maximize Recall to reduce False Negatives.

Model Building

Function for confusion matrix

In [90]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(cm.shape[0], cm.shape[1])

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Random Forest Model (default with Word2Vec)

In [91]:
# Building the model
rf_word2vec = RandomForestClassifier(random_state = 42)

# Fitting on train data
rf_word2vec.fit(X_train_word2vec, y_train_word2vec)
Out[91]:
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Confusion Matrix

In [92]:
confusion_matrix_sklearn(rf_word2vec, X_train_word2vec, y_train_word2vec)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [93]:
confusion_matrix_sklearn(rf_word2vec, X_valid_word2vec, y_valid_word2vec)

Observations:

  • Around 40% of the predicted values match with the actual values.
In [94]:
# Predicting on train data
y_pred_train_word2vec = rf_word2vec.predict(X_train_word2vec)

# Predicting on validation data
y_pred_valid_word2vec = rf_word2vec.predict(X_valid_word2vec)

Classification Report

In [95]:
print(classification_report(y_train_word2vec, y_pred_train_word2vec))
              precision    recall  f1-score   support

          -1       1.00      1.00      1.00        79
           0       1.00      1.00      1.00       136
           1       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 100% which indicates that the model is overfitting.
In [96]:
default_word2vec_report = classification_report(y_valid_word2vec, y_pred_valid_word2vec)
print(default_word2vec_report)
              precision    recall  f1-score   support

          -1       0.20      0.10      0.13        10
           0       0.43      0.71      0.53        17
           1       0.50      0.12      0.20         8

    accuracy                           0.40        35
   macro avg       0.38      0.31      0.29        35
weighted avg       0.38      0.40      0.34        35

Observations:

  • We got a weighted recall score of 40% which indicates that the model was overfitting.
    • Negative sentiment recall score is 10%
    • Neutral sentiment recall score is 71%
    • Positive sentiment recall score is 12%

Random Forest Model (default with GloVe)

In [97]:
# Building the model
rf_glove = RandomForestClassifier(random_state = 42)

# Fitting on train data
rf_glove.fit(X_train_glove, y_train_glove)
Out[97]:
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Confusion Matrix

In [98]:
confusion_matrix_sklearn(rf_glove, X_train_glove, y_train_glove)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [99]:
confusion_matrix_sklearn(rf_glove, X_valid_glove, y_valid_glove)

Observations:

  • Around 46% of the predicted values match the true values.
In [100]:
# Predicting on train data
y_pred_train_glove = rf_glove.predict(X_train_glove)

# Predicting on validation data
y_pred_valid_glove = rf_glove.predict(X_valid_glove)

Classification Report

In [101]:
print(classification_report(y_train_glove, y_pred_train_glove))
              precision    recall  f1-score   support

          -1       1.00      1.00      1.00        79
           0       1.00      1.00      1.00       136
           1       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 100% which indicates that the model is overfitting.
In [102]:
default_glove_report = classification_report(y_valid_glove, y_pred_valid_glove)
print(default_glove_report)
              precision    recall  f1-score   support

          -1       0.33      0.20      0.25        10
           0       0.48      0.76      0.59        17
           1       0.50      0.12      0.20         8

    accuracy                           0.46        35
   macro avg       0.44      0.36      0.35        35
weighted avg       0.44      0.46      0.40        35

Observations:

  • We got a weighted recall score of 46% which indicates that the model was overfitting.
    • Negative sentiment recall score is 20%
    • Neutral sentiment recall score is 76%
    • Positive sentiment recall score is 12%

Random Forest Model (default with Sentence Transformer)

In [103]:
rf_sent_transformer = RandomForestClassifier(random_state = 42)

# Fitting on train data
rf_sent_transformer.fit(X_train_sent_transformer, y_train_sent_transformer)
Out[103]:
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Confusion Matrix

In [104]:
confusion_matrix_sklearn(rf_sent_transformer, X_train_sent_transformer, y_train_sent_transformer)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [105]:
confusion_matrix_sklearn(rf_sent_transformer, X_valid_sent_transformer, y_valid_sent_transformer)

Observations:

  • For Validation dataset, around 43% of the predicted values match the true values.
In [106]:
# Predicting on train data
y_pred_train_sent_transformer = rf_sent_transformer.predict(X_train_sent_transformer)

# Predicting on validation data
y_pred_valid_sent_transformer = rf_sent_transformer.predict(X_valid_sent_transformer)

Classification Report

In [107]:
print(classification_report(y_train_sent_transformer, y_pred_train_sent_transformer))
              precision    recall  f1-score   support

          -1       1.00      1.00      1.00        79
           0       1.00      1.00      1.00       136
           1       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 100% which indicates that the model is overfitting.
In [108]:
#included zero_division=1 to address the warning related to precision values
default_sent_report = classification_report(y_valid_sent_transformer, y_pred_valid_sent_transformer, zero_division=1)
print(default_sent_report)
              precision    recall  f1-score   support

          -1       0.25      0.10      0.14        10
           0       0.45      0.76      0.57        17
           1       0.50      0.12      0.20         8

    accuracy                           0.43        35
   macro avg       0.40      0.33      0.30        35
weighted avg       0.40      0.43      0.36        35

Observations:

  • We got a weighted recall score of 43% which indicates that the model was overfitting.
    • Negative sentiment recall score is 10%
    • Neutral sentiment recall score is 76%
    • Positive sentiment recall score is 12%

We'll try to address the class imbalance problem now with Class weights.

Random Forest (with class_weights and Word2Vec)

In [109]:
rf_word2vec_balanced = RandomForestClassifier(class_weight="balanced", random_state=42)
rf_word2vec_balanced.fit(X_train_word2vec, y_train_word2vec)
Out[109]:
RandomForestClassifier(class_weight='balanced', random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Confusion Matrix

In [110]:
confusion_matrix_sklearn(rf_word2vec_balanced, X_train_word2vec, y_train_word2vec)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [111]:
confusion_matrix_sklearn(rf_word2vec_balanced, X_valid_word2vec, y_valid_word2vec)

Observations:

  • For Validation dataset, around 43% of the predicted values match the true values.
In [112]:
# Predicting on train data
y_pred_train_word2vec_balanced = rf_word2vec_balanced.predict(X_train_word2vec)

# Predicting on test data
y_pred_valid_word2vec_balanced = rf_word2vec_balanced.predict(X_valid_word2vec)

Classification report

In [113]:
print(classification_report(y_train_word2vec, y_pred_train_word2vec_balanced))
              precision    recall  f1-score   support

          -1       1.00      1.00      1.00        79
           0       1.00      1.00      1.00       136
           1       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 100% which indicates that the model is overfitting.
In [114]:
weighted_word2vec_report = classification_report(y_valid_word2vec, y_pred_valid_word2vec_balanced)
print(weighted_word2vec_report)
              precision    recall  f1-score   support

          -1       0.25      0.10      0.14        10
           0       0.45      0.76      0.57        17
           1       0.50      0.12      0.20         8

    accuracy                           0.43        35
   macro avg       0.40      0.33      0.30        35
weighted avg       0.40      0.43      0.36        35

Observations:

  • We got a weighted recall score of 43% which indicates that the model was overfitting.
    • Negative sentiment recall score is 10%
    • Neutral sentiment recall score is 76%
    • Positive sentiment recall score is 12%

Random Forest (with class_weights and GloVe)

In [115]:
rf_glove_balanced = RandomForestClassifier(class_weight="balanced", random_state=42)
rf_glove_balanced.fit(X_train_glove, y_train_glove)
Out[115]:
RandomForestClassifier(class_weight='balanced', random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Confusion Matrix

In [116]:
confusion_matrix_sklearn(rf_glove_balanced, X_train_glove, y_train_glove)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [117]:
confusion_matrix_sklearn(rf_glove_balanced, X_valid_glove, y_valid_glove)

Observations,

  • For validation dataset, we see that around 40% of the predicted values match with the true values.
In [118]:
# Predicting on train data
y_pred_train_glove_balanced = rf_glove_balanced.predict(X_train_glove)

# Predicting on test data
y_pred_valid_glove_balanced = rf_glove_balanced.predict(X_valid_glove)

Classification report

In [119]:
print(classification_report(y_train_glove, y_pred_train_glove_balanced))
              precision    recall  f1-score   support

          -1       1.00      1.00      1.00        79
           0       1.00      1.00      1.00       136
           1       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 100% which indicates that the model is overfitting.
In [120]:
weighted_glove_report = classification_report(y_valid_glove, y_pred_valid_glove_balanced, zero_division=1)
print(weighted_glove_report)
              precision    recall  f1-score   support

          -1       0.14      0.10      0.12        10
           0       0.46      0.76      0.58        17
           1       1.00      0.00      0.00         8

    accuracy                           0.40        35
   macro avg       0.54      0.29      0.23        35
weighted avg       0.49      0.40      0.31        35

Observations:

  • We got a weighted recall score of 40% which indicates that the model was overfitting.
    • Negative sentiment recall score is 10%
    • Neutral sentiment recall score is 76%
    • Positive sentiment recall score is 0%

Random Forest (with class_weights and Sentence Transformer)

In [121]:
rf_sent_transformer_balanced = RandomForestClassifier(class_weight="balanced", random_state=42)
rf_sent_transformer_balanced.fit(X_train_sent_transformer, y_train_sent_transformer)
Out[121]:
RandomForestClassifier(class_weight='balanced', random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Confusion Matrix

In [122]:
confusion_matrix_sklearn(rf_sent_transformer_balanced, X_train_sent_transformer, y_train_sent_transformer)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [123]:
confusion_matrix_sklearn(rf_sent_transformer_balanced, X_valid_sent_transformer, y_valid_sent_transformer)

Observations:

  • For validation dataset, we see around 46% of the predicted values match the actual values.
In [124]:
#predicting on train data
y_pred_train_sent_transformer_balanced = rf_sent_transformer_balanced.predict(X_train_sent_transformer)

#predicting on test data
y_pred_valid_sent_transformer_balanced = rf_sent_transformer_balanced.predict(X_valid_sent_transformer)

Classification Report

In [125]:
print(classification_report(y_train_sent_transformer, y_pred_train_sent_transformer_balanced))
              precision    recall  f1-score   support

          -1       1.00      1.00      1.00        79
           0       1.00      1.00      1.00       136
           1       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 100% which indicates that the model is overfitting.
In [126]:
weighted_sent_report = classification_report(y_valid_sent_transformer, y_pred_valid_sent_transformer_balanced, zero_division=1)
print(weighted_sent_report)
              precision    recall  f1-score   support

          -1       0.25      0.10      0.14        10
           0       0.47      0.82      0.60        17
           1       1.00      0.12      0.22         8

    accuracy                           0.46        35
   macro avg       0.57      0.35      0.32        35
weighted avg       0.53      0.46      0.38        35

Observations:

  • We got a weighted recall score of 46% which indicates that the model was overfitting.
    • Negative sentiment recall score is 10%
    • Neutral sentiment recall score is 82%
    • Positive sentiment recall score is 12%

Random Forest (with hyperparameter tuning and Word2Vec)

In [127]:
import sklearn.metrics as metrics
from sklearn.model_selection import GridSearchCV

# Choose the type of classifier
rf_tuned = RandomForestClassifier(class_weight="balanced", random_state=42)

# defining the hyperparameter grid for tuning
parameters = {
    "max_depth": list(np.arange(4, 15, 2)),
    "max_features": ["sqrt", 0.5, 0.7],
    "min_samples_split": [5, 6, 7],
    "n_estimators": np.arange(30, 110, 10),
}

# defining the type of scoring used to compare parameter combinations
# we need to specify the mechanism of averaging as we have more than 2 target classes
scorer = metrics.make_scorer(metrics.recall_score, average='weighted')

# running the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train_word2vec, y_train_word2vec)
In [128]:
# Creating a new model with the best combination of parameters
rf_word2vec_tuned = grid_obj.best_estimator_

# Fit the new model to the data
rf_word2vec_tuned.fit(X_train_word2vec, y_train_word2vec)
Out[128]:
RandomForestClassifier(class_weight='balanced', max_depth=10, max_features=0.5,
                       min_samples_split=6, n_estimators=30, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Confusion Matrix

In [129]:
confusion_matrix_sklearn(rf_word2vec_tuned, X_train_word2vec, y_train_word2vec)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [130]:
confusion_matrix_sklearn(rf_word2vec_tuned, X_valid_word2vec, y_valid_word2vec)

Observations:

  • For validation dataset, around 40% of the predicted values match the actual values.
In [131]:
# Predicting on train data
y_pred_train_word2vec_tuned = rf_word2vec_tuned.predict(X_train_word2vec)

# Predicting on validation data
y_pred_valid_word2vec_tuned = rf_word2vec_tuned.predict(X_valid_word2vec)

Classification Report

In [132]:
print(classification_report(y_train_word2vec, y_pred_train_word2vec_tuned))
              precision    recall  f1-score   support

          -1       1.00      0.99      0.99        79
           0       0.99      1.00      1.00       136
           1       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 99% which still indicates that the model is overfitting.
In [133]:
tuned_word2vec_report = classification_report(y_valid_word2vec, y_pred_valid_word2vec_tuned)
print(tuned_word2vec_report)
              precision    recall  f1-score   support

          -1       0.14      0.10      0.12        10
           0       0.44      0.65      0.52        17
           1       0.67      0.25      0.36         8

    accuracy                           0.40        35
   macro avg       0.42      0.33      0.34        35
weighted avg       0.41      0.40      0.37        35

Observations:

  • We got a weighted recall score of 40% which indicates that the model was overfitting.
    • Negative sentiment recall score is 10%
    • Neutral sentiment recall score is 65%
    • Positive sentiment recall score is 25%

Random Forest (with hyperparametertuning and GloVe)

In [134]:
# Choose the type of classifier
rf_tuned = RandomForestClassifier(class_weight="balanced", random_state=42)

# defining the hyperparameter grid for tuning
parameters = {
    "max_depth": list(np.arange(4, 15, 2)),
    "max_features": ["sqrt", 0.5, 0.7],
    "min_samples_split": [5, 6, 7],
    "n_estimators": np.arange(30, 110, 10),
}

# defining the type of scoring used to compare parameter combinations
# we need to specify the mechanism of averaging as we have more than 2 target classes
scorer = metrics.make_scorer(metrics.recall_score, average='weighted')

# running the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train_glove, y_train_glove)
In [135]:
# Creating a new model with the best combination of parameters
rf_glove_tuned = grid_obj.best_estimator_

# Fit the new model to the data
rf_glove_tuned.fit(X_train_glove, y_train_glove)
Out[135]:
RandomForestClassifier(class_weight='balanced', max_depth=14,
                       min_samples_split=5, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Confusion Matrix

In [136]:
#Printing the confusion matrix
confusion_matrix_sklearn(rf_glove_tuned, X_train_glove, y_train_glove)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [137]:
#Printing the confusion matrix
confusion_matrix_sklearn(rf_glove_tuned, X_valid_glove, y_valid_glove)

Observations:

  • For validation dataset, we see that around 37% of the predicted values match the actual values.
In [138]:
# Predicting on train data
y_pred_train_glove_tuned = rf_glove_tuned.predict(X_train_glove)

# Predicting on validation data
y_pred_valid_glove_tuned = rf_glove_tuned.predict(X_valid_glove)

Classification Report

In [139]:
print(classification_report(y_train_glove, y_pred_train_glove_tuned))
              precision    recall  f1-score   support

          -1       1.00      1.00      1.00        79
           0       1.00      1.00      1.00       136
           1       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 100% which indicates that the model is overfitting.
In [140]:
tuned_glove_report = classification_report(y_valid_glove, y_pred_valid_glove_tuned)
print(tuned_glove_report)
              precision    recall  f1-score   support

          -1       0.17      0.10      0.12        10
           0       0.41      0.65      0.50        17
           1       0.50      0.12      0.20         8

    accuracy                           0.37        35
   macro avg       0.36      0.29      0.27        35
weighted avg       0.36      0.37      0.32        35

Observations:

  • We got a weighted recall score of 37% which indicates that the model was overfitting.
    • Negative sentiment recall score is 10%
    • Neutral sentiment recall score is 65%
    • Positive sentiment recall score is 12%

Random Forest (with hyperparameter tuning and Sentence transformer)

In [141]:
# Choose the type of classifier
rf_tuned = RandomForestClassifier(class_weight="balanced", random_state=42)

# defining the hyperparameter grid for tuning
parameters = {
    "max_depth": list(np.arange(4, 15, 2)),
    "max_features": ["sqrt", 0.5, 0.7],
    "min_samples_split": [5, 6, 7],
    "n_estimators": np.arange(30, 110, 10),
}

# defining the type of scoring used to compare parameter combinations
# we need to specify the mechanism of averaging as we have more than 2 target classes
scorer = metrics.make_scorer(metrics.recall_score, average='weighted')

# running the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train_sent_transformer, y_train_sent_transformer)
In [142]:
# Creating a new model with the best combination of parameters
rf_sent_tuned = grid_obj.best_estimator_

# Fit the new model to the data
rf_sent_tuned.fit(X_train_sent_transformer, y_train_sent_transformer)
Out[142]:
RandomForestClassifier(class_weight='balanced', max_depth=4,
                       min_samples_split=5, n_estimators=80, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [143]:
#Printing the confusion matrix
confusion_matrix_sklearn(rf_sent_tuned, X_train_sent_transformer, y_train_sent_transformer)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [144]:
#Printing the confusion matrix
confusion_matrix_sklearn(rf_sent_tuned, X_valid_sent_transformer, y_valid_sent_transformer)

Observations:

  • For validation dataset, we see that around 43% of the predicted values match the actual values.
In [145]:
# Predicting on train data
y_pred_train_sent_tuned = rf_sent_tuned.predict(X_train_sent_transformer)

# Predicting on validation data
y_pred_valid_sent_tuned = rf_sent_tuned.predict(X_valid_sent_transformer)

Classification Report

In [146]:
print(classification_report(y_train_sent_transformer, y_pred_train_sent_tuned))
              precision    recall  f1-score   support

          -1       0.95      0.97      0.96        79
           0       0.98      0.96      0.97       136
           1       0.97      0.98      0.98        64

    accuracy                           0.97       279
   macro avg       0.97      0.97      0.97       279
weighted avg       0.97      0.97      0.97       279

Observations:

  • For training data, We have a Recall score of 97%.
In [147]:
tuned_sent_report = classification_report(y_valid_sent_transformer, y_pred_valid_sent_tuned)
print(tuned_sent_report)
              precision    recall  f1-score   support

          -1       0.25      0.20      0.22        10
           0       0.46      0.65      0.54        17
           1       0.67      0.25      0.36         8

    accuracy                           0.43        35
   macro avg       0.46      0.37      0.37        35
weighted avg       0.45      0.43      0.41        35

Observations:

  • We got a weighted recall score of 43% which indicates that the model was overfitting.
    • Negative sentiment recall score is 20%
    • Neutral sentiment recall score is 65%
    • Positive sentiment recall score is 25%

XGBoost (default and Word2Vec)

In [148]:
import pandas as pd
from xgboost import XGBClassifier

#XGBoost doesnt take negative values/classes in y. So map the classes to positive number
# Assuming y_train_word2vec is a pandas Series
y_train_xgb_word2vec = y_train_word2vec.map({-1: 0, 0: 1, 1: 2})  # Change -1 to 0, 0 to 1, and 1 to 2

#Fitting the model
xgb_word2vec = XGBClassifier(random_state=42, eval_metric='logloss')
xgb_word2vec.fit(X_train_word2vec, y_train_xgb_word2vec)

y_valid_xgb_word2vec = y_valid_word2vec.map({-1: 0, 0: 1, 1: 2})  # Change -1 to 0, 0 to 1, and 1 to 2

Confusion Matrix

In [149]:
confusion_matrix_sklearn(xgb_word2vec, X_train_word2vec, y_train_xgb_word2vec)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [150]:
confusion_matrix_sklearn(xgb_word2vec, X_valid_word2vec, y_valid_xgb_word2vec)

Observations:

  • For Validation dataset, we see that around 34% of the predicted values match the actual values.
In [151]:
# Predicting on train data
y_pred_train_xgbword2vec = xgb_word2vec.predict(X_train_word2vec)

# Predicting on validation data
y_pred_valid_xgbword2vec = xgb_word2vec.predict(X_valid_word2vec)

Classification Report

In [152]:
print(classification_report(y_train_xgb_word2vec, y_pred_train_xgbword2vec))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        79
           1       1.00      1.00      1.00       136
           2       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 100% which indicates that the model is overfitting.
In [153]:
default_xgb_word2vec_report = classification_report(y_valid_xgb_word2vec, y_pred_valid_xgbword2vec)
print(default_xgb_word2vec_report)
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.41      0.65      0.50        17
           2       0.00      0.00      0.00         8

    accuracy                           0.31        35
   macro avg       0.14      0.22      0.17        35
weighted avg       0.20      0.31      0.24        35

Observations:

  • We got a weighted recall score of 34% which indicates that the model was overfitting.
    • Negative sentiment recall score is 0%
    • Neutral sentiment recall score is 65%
    • Positive sentiment recall score is 0%

XGBoost Model (default with GloVe)

In [154]:
#XGBoost doesnt take negative values/classes in y. So map the classes to positive number
y_train_xgb_glove = y_train_glove.map({-1: 0, 0: 1, 1: 2})  # Change -1 to 0, 0 to 1, and 1 to 2
y_valid_xgb_glove = y_valid_glove.map({-1: 0, 0: 1, 1: 2})  # Change -1 to 0, 0 to 1, and 1 to 2
In [155]:
# Building the model
xgb_glove = XGBClassifier(random_state = 42)

# Fitting on train data
xgb_glove.fit(X_train_glove, y_train_xgb_glove)
Out[155]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, objective='multi:softprob', ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Confusion Matrix

In [156]:
confusion_matrix_sklearn(xgb_glove, X_train_glove, y_train_xgb_glove)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [157]:
confusion_matrix_sklearn(xgb_glove, X_valid_glove, y_valid_xgb_glove)

Observations:

  • For Validation dataset, we see that around 40% of the predicted values match the actual values.
In [158]:
# Predicting on train data
y_pred_train_xgb_glove = xgb_glove.predict(X_train_glove)

# Predicting on validation data
y_pred_valid_xgb_glove = xgb_glove.predict(X_valid_glove)

Classification Report

In [159]:
print(classification_report(y_train_xgb_glove, y_pred_train_xgb_glove))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        79
           1       1.00      1.00      1.00       136
           2       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 100% which indicates that the model is overfitting.
In [160]:
default_xgb_glove_report = classification_report(y_valid_xgb_glove, y_pred_valid_xgb_glove)
print(default_xgb_glove_report)
              precision    recall  f1-score   support

           0       0.22      0.20      0.21        10
           1       0.46      0.65      0.54        17
           2       0.50      0.12      0.20         8

    accuracy                           0.40        35
   macro avg       0.39      0.32      0.32        35
weighted avg       0.40      0.40      0.37        35

Observations:

  • We got a weighted recall score of 40% which indicates that the model was overfitting.
    • Negative sentiment recall score is 20%
    • Neutral sentiment recall score is 65%
    • Positive sentiment recall score is 12%

XGBoost Model (default with Sentence Transformer)

In [161]:
#XGBoost doesnt take negative values/classes in y. So map the classes to positive number
y_train_xgb_sent_transformer = y_train_sent_transformer.map({-1: 0, 0: 1, 1: 2})  # Change -1 to 0, 0 to 1, and 1 to 2
y_valid_xgb_sent_transformer = y_valid_sent_transformer.map({-1: 0, 0: 1, 1: 2})  # Change -1 to 0, 0 to 1, and 1 to 2
In [162]:
xgb_sent_transformer = XGBClassifier(random_state = 42)

# Fitting on train data
xgb_sent_transformer.fit(X_train_sent_transformer, y_train_xgb_sent_transformer)
Out[162]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, objective='multi:softprob', ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Confusion Matrix

In [163]:
confusion_matrix_sklearn(xgb_sent_transformer, X_train_sent_transformer, y_train_xgb_sent_transformer)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [164]:
confusion_matrix_sklearn(xgb_sent_transformer, X_valid_sent_transformer, y_valid_xgb_sent_transformer)

Observations:

  • For validation dataset, we see that around 46% of the predicted values match the actual values.
In [165]:
# Predicting on train data
y_pred_train_xgb_sent_transformer = xgb_sent_transformer.predict(X_train_sent_transformer)

# Predicting on validation data
y_pred_valid_xgb_sent_transformer = xgb_sent_transformer.predict(X_valid_sent_transformer)

Classification Report

In [166]:
print(classification_report(y_train_xgb_sent_transformer, y_pred_train_xgb_sent_transformer))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        79
           1       1.00      1.00      1.00       136
           2       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 100% which indicates that the model is overfitting.
In [167]:
#included zero_division=1 to address the warning related to precision values
default_xgb_sent_report = classification_report(y_valid_xgb_sent_transformer, y_pred_valid_xgb_sent_transformer, zero_division=1)
print(default_xgb_sent_report)
              precision    recall  f1-score   support

           0       0.43      0.30      0.35        10
           1       0.45      0.59      0.51        17
           2       0.50      0.38      0.43         8

    accuracy                           0.46        35
   macro avg       0.46      0.42      0.43        35
weighted avg       0.46      0.46      0.45        35

Observations:

  • We got a weighted recall score of 46% which indicates that the model was overfitting.
    • Negative sentiment recall score is 30%
    • Neutral sentiment recall score is 59%
    • Positive sentiment recall score is 38%

XGBoost (with hyperparameter tuning and Word2Vec)

In [168]:
import sklearn.metrics as metrics
from sklearn.model_selection import GridSearchCV

# Choose the type of classifier
xgb_tuned = XGBClassifier(random_state=42, eval_metric='mlogloss')

# defining the hyperparameter grid for tuning
parameters = {
    "n_estimators": [10,30,50],
    "subsample":[0.7,0.9,1],
    "learning_rate":[0.05, 0.1,0.2],
    "colsample_bytree":[0.7,0.9,1],
    "colsample_bylevel":[0.5,0.7,1]
}

# defining the type of scoring used to compare parameter combinations
# we need to specify the mechanism of averaging as we have more than 2 target classes
scorer = metrics.make_scorer(metrics.recall_score, average='weighted')

# running the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer, cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train_word2vec, y_train_xgb_word2vec)
In [169]:
# Creating a new model with the best combination of parameters
xgb_word2vec_tuned = grid_obj.best_estimator_

# Fit the new model to the data
xgb_word2vec_tuned.fit(X_train_word2vec, y_train_xgb_word2vec)
Out[169]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=1, colsample_bynode=None, colsample_bytree=0.7,
              device=None, early_stopping_rounds=None, enable_categorical=False,
              eval_metric='mlogloss', feature_types=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.05, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=30, n_jobs=None,
              num_parallel_tree=None, objective='multi:softprob', ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Confusion Matrix

In [170]:
confusion_matrix_sklearn(xgb_word2vec_tuned, X_train_word2vec, y_train_xgb_word2vec)

Observations:

  • Around 4.7% predicted values dont match with true values.
In [171]:
confusion_matrix_sklearn(xgb_word2vec_tuned, X_valid_word2vec, y_valid_xgb_word2vec)

Observations:

  • For validation dataset, we see that around 37% of the predicted values match the actual values.
In [172]:
# Predicting on train data
y_pred_train_xgb_word2vec_tuned = xgb_word2vec_tuned.predict(X_train_word2vec)

# Predicting on validation data
y_pred_valid_xgb_word2vec_tuned = xgb_word2vec_tuned.predict(X_valid_word2vec)

Classification Report

In [173]:
print(classification_report(y_train_xgb_word2vec, y_pred_train_xgb_word2vec_tuned))
              precision    recall  f1-score   support

           0       1.00      0.99      0.99        79
           1       0.99      1.00      1.00       136
           2       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 95%.
In [174]:
xgb_tuned_word2vec_report = classification_report(y_valid_xgb_word2vec, y_pred_valid_xgb_word2vec_tuned)
print(xgb_tuned_word2vec_report)
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.45      0.76      0.57        17
           2       0.50      0.12      0.20         8

    accuracy                           0.40        35
   macro avg       0.32      0.30      0.26        35
weighted avg       0.33      0.40      0.32        35

Observations:

  • We got a weighted recall score of 37% which indicates that the model was overfitting.
    • Negative sentiment recall score is 0%
    • Neutral sentiment recall score is 76%
    • Positive sentiment recall score is 0%

XGBoost (with hyperparametertuning and GloVe)

In [175]:
# Choose the type of classifier
xgb_tuned = XGBClassifier(random_state=42, eval_metric='mlogloss')

# defining the hyperparameter grid for tuning
parameters = {
    "n_estimators": [10,30,50],
    "subsample":[0.7,0.9,1],
    "learning_rate":[0.05, 0.1,0.2],
    "colsample_bytree":[0.7,0.9,1],
    "colsample_bylevel":[0.5,0.7,1]
}
# defining the type of scoring used to compare parameter combinations
# we need to specify the mechanism of averaging as we have more than 2 target classes
scorer = metrics.make_scorer(metrics.recall_score, average='weighted')

# running the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer, cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train_glove, y_train_xgb_glove)
In [176]:
# Creating a new model with the best combination of parameters
xgb_glove_tuned = grid_obj.best_estimator_

# Fit the new model to the data
xgb_glove_tuned.fit(X_train_glove, y_train_xgb_glove)
Out[176]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=0.5, colsample_bynode=None,
              colsample_bytree=0.7, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='mlogloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.05, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=30,
              n_jobs=None, num_parallel_tree=None, objective='multi:softprob', ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Confusion Matrix

In [177]:
#Printing the confusion matrix
confusion_matrix_sklearn(xgb_glove_tuned, X_train_glove, y_train_xgb_glove)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [178]:
#Printing the confusion matrix
confusion_matrix_sklearn(xgb_glove_tuned, X_valid_glove, y_valid_xgb_glove)

Observations:

  • For validation dataset, we see that around 46% of the predicted values match the actual values.
In [179]:
# Predicting on train data
y_pred_train_xgb_glove_tuned = xgb_glove_tuned.predict(X_train_glove)

# Predicting on validation data
y_pred_valid_xgb_glove_tuned = xgb_glove_tuned.predict(X_valid_glove)

Classification Report

In [180]:
print(classification_report(y_train_xgb_glove, y_pred_train_xgb_glove_tuned))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        79
           1       1.00      1.00      1.00       136
           2       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 100% which indicates that the model is overfitting.
In [181]:
xgb_tuned_glove_report = classification_report(y_valid_xgb_glove, y_pred_valid_xgb_glove_tuned)
print(xgb_tuned_glove_report)
              precision    recall  f1-score   support

           0       0.33      0.30      0.32        10
           1       0.52      0.71      0.60        17
           2       0.33      0.12      0.18         8

    accuracy                           0.46        35
   macro avg       0.40      0.38      0.37        35
weighted avg       0.42      0.46      0.42        35

Observations:

  • We got a weighted recall score of 46% which indicates that the model was overfitting.
    • Negative sentiment recall score is 30%
    • Neutral sentiment recall score is 71%
    • Positive sentiment recall score is 12%

XGBoost(with Hyperparameter Tuning and Sentence Transformer)

In [182]:
# Choose the type of classifier
xgb_tuned = XGBClassifier(random_state=42, eval_metric='mlogloss')

# defining the hyperparameter grid for tuning
parameters = {
    "n_estimators": [10,30,50],
    "subsample":[0.7,0.9,1],
    "learning_rate":[0.05, 0.1,0.2],
    "colsample_bytree":[0.7,0.9,1],
    "colsample_bylevel":[0.5,0.7,1]
}
# defining the type of scoring used to compare parameter combinations
# we need to specify the mechanism of averaging as we have more than 2 target classes
scorer = metrics.make_scorer(metrics.recall_score, average='weighted')

# running the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer, cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train_sent_transformer, y_train_xgb_sent_transformer)
In [183]:
# Creating a new model with the best combination of parameters
xgb_sent_tuned = grid_obj.best_estimator_

# Fit the new model to the data
xgb_sent_tuned.fit(X_train_sent_transformer, y_train_xgb_sent_transformer)
Out[183]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=0.5, colsample_bynode=None,
              colsample_bytree=0.9, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='mlogloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.1, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=10,
              n_jobs=None, num_parallel_tree=None, objective='multi:softprob', ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [184]:
#Printing the confusion matrix
confusion_matrix_sklearn(xgb_sent_tuned, X_train_sent_transformer, y_train_xgb_sent_transformer)

Observations:

  • We see that all predicted values matches the true/actual values and it looks like that the data is overfitting.
In [185]:
#Printing the confusion matrix
confusion_matrix_sklearn(xgb_sent_tuned, X_valid_sent_transformer, y_valid_xgb_sent_transformer)

Observations:

  • For validation dataset, we see that around 34% of the predicted values match the actual values.
In [186]:
# Predicting on train data
y_pred_train_xgb_sent_tuned = xgb_sent_tuned.predict(X_train_sent_transformer)

# Predicting on validation data
y_pred_valid_xgb_sent_tuned = xgb_sent_tuned.predict(X_valid_sent_transformer)

Classification Report

In [187]:
print(classification_report(y_train_xgb_sent_transformer, y_pred_train_xgb_sent_tuned))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        79
           1       1.00      1.00      1.00       136
           2       1.00      1.00      1.00        64

    accuracy                           1.00       279
   macro avg       1.00      1.00      1.00       279
weighted avg       1.00      1.00      1.00       279

Observations:

  • For training data, We have a Recall score of 100% which indicates that the model is overfitting.
In [188]:
xgb_tuned_sent_report = classification_report(y_valid_xgb_sent_transformer, y_pred_valid_xgb_sent_tuned)
print(xgb_tuned_sent_report)
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.39      0.65      0.49        17
           2       0.20      0.12      0.15         8

    accuracy                           0.34        35
   macro avg       0.20      0.26      0.21        35
weighted avg       0.24      0.34      0.27        35

Observations:

  • We got a weighted recall score of 34% which indicates that the model was overfitting.
    • Negative sentiment recall score is 0%
    • Neutral sentiment recall score is 65%
    • Positive sentiment recall score is 12%

Models Performance Summary

In [189]:
# Summarize all the reports
def summarize_reports(reports, model_type):
    """Summarizes model reports in a structured format.

    Args:
        reports (dict): A dictionary of reports where keys are report names
                       and values are the actual report objects.
        model_type (str): The type of model (e.g., "Random Forest", "XGBoost").

    Returns:
        None (prints the summary to the console)
    """
    print(f"-----------------{model_type.upper()} MODELS---------------------- ")
    for report_name, report in reports.items():
        print(f"\n{report_name.replace('_', ' ').title()} Report:")
        print(report)
    print("-" * 50)

# Define dictionaries to hold your reports
random_forest_reports = {
    "default_word2vec": default_word2vec_report,
    "weighted_word2vec": weighted_word2vec_report,
    "tuned_word2vec": tuned_word2vec_report,
    "default_glove": default_glove_report,
    "weighted_glove": weighted_glove_report,
    "tuned_glove": tuned_glove_report,
    "default_sentence_transformer": default_sent_report,
    "weighted_sentence_transformer": weighted_sent_report,
    "tuned_sentence_transformer": tuned_sent_report,
}

xgboost_reports = {
    "default_word2vec": default_xgb_word2vec_report,
    "tuned_word2vec": xgb_tuned_word2vec_report,
    "default_glove": default_xgb_glove_report,
    "tuned_glove": xgb_tuned_glove_report,
    "default_sentence_transformer": default_xgb_sent_report,
    "tuned_sentence_transformer": xgb_tuned_sent_report,
}

# Print the summarized reports
print("Metrics summary of all the models")
print("-" * 50)

summarize_reports(random_forest_reports, "Random Forest")
summarize_reports(xgboost_reports, "XGBoost")
Metrics summary of all the models
--------------------------------------------------
-----------------RANDOM FOREST MODELS---------------------- 

Default Word2Vec Report:
              precision    recall  f1-score   support

          -1       0.20      0.10      0.13        10
           0       0.43      0.71      0.53        17
           1       0.50      0.12      0.20         8

    accuracy                           0.40        35
   macro avg       0.38      0.31      0.29        35
weighted avg       0.38      0.40      0.34        35


Weighted Word2Vec Report:
              precision    recall  f1-score   support

          -1       0.25      0.10      0.14        10
           0       0.45      0.76      0.57        17
           1       0.50      0.12      0.20         8

    accuracy                           0.43        35
   macro avg       0.40      0.33      0.30        35
weighted avg       0.40      0.43      0.36        35


Tuned Word2Vec Report:
              precision    recall  f1-score   support

          -1       0.14      0.10      0.12        10
           0       0.44      0.65      0.52        17
           1       0.67      0.25      0.36         8

    accuracy                           0.40        35
   macro avg       0.42      0.33      0.34        35
weighted avg       0.41      0.40      0.37        35


Default Glove Report:
              precision    recall  f1-score   support

          -1       0.33      0.20      0.25        10
           0       0.48      0.76      0.59        17
           1       0.50      0.12      0.20         8

    accuracy                           0.46        35
   macro avg       0.44      0.36      0.35        35
weighted avg       0.44      0.46      0.40        35


Weighted Glove Report:
              precision    recall  f1-score   support

          -1       0.14      0.10      0.12        10
           0       0.46      0.76      0.58        17
           1       1.00      0.00      0.00         8

    accuracy                           0.40        35
   macro avg       0.54      0.29      0.23        35
weighted avg       0.49      0.40      0.31        35


Tuned Glove Report:
              precision    recall  f1-score   support

          -1       0.17      0.10      0.12        10
           0       0.41      0.65      0.50        17
           1       0.50      0.12      0.20         8

    accuracy                           0.37        35
   macro avg       0.36      0.29      0.27        35
weighted avg       0.36      0.37      0.32        35


Default Sentence Transformer Report:
              precision    recall  f1-score   support

          -1       0.25      0.10      0.14        10
           0       0.45      0.76      0.57        17
           1       0.50      0.12      0.20         8

    accuracy                           0.43        35
   macro avg       0.40      0.33      0.30        35
weighted avg       0.40      0.43      0.36        35


Weighted Sentence Transformer Report:
              precision    recall  f1-score   support

          -1       0.25      0.10      0.14        10
           0       0.47      0.82      0.60        17
           1       1.00      0.12      0.22         8

    accuracy                           0.46        35
   macro avg       0.57      0.35      0.32        35
weighted avg       0.53      0.46      0.38        35


Tuned Sentence Transformer Report:
              precision    recall  f1-score   support

          -1       0.25      0.20      0.22        10
           0       0.46      0.65      0.54        17
           1       0.67      0.25      0.36         8

    accuracy                           0.43        35
   macro avg       0.46      0.37      0.37        35
weighted avg       0.45      0.43      0.41        35

--------------------------------------------------
-----------------XGBOOST MODELS---------------------- 

Default Word2Vec Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.41      0.65      0.50        17
           2       0.00      0.00      0.00         8

    accuracy                           0.31        35
   macro avg       0.14      0.22      0.17        35
weighted avg       0.20      0.31      0.24        35


Tuned Word2Vec Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.45      0.76      0.57        17
           2       0.50      0.12      0.20         8

    accuracy                           0.40        35
   macro avg       0.32      0.30      0.26        35
weighted avg       0.33      0.40      0.32        35


Default Glove Report:
              precision    recall  f1-score   support

           0       0.22      0.20      0.21        10
           1       0.46      0.65      0.54        17
           2       0.50      0.12      0.20         8

    accuracy                           0.40        35
   macro avg       0.39      0.32      0.32        35
weighted avg       0.40      0.40      0.37        35


Tuned Glove Report:
              precision    recall  f1-score   support

           0       0.33      0.30      0.32        10
           1       0.52      0.71      0.60        17
           2       0.33      0.12      0.18         8

    accuracy                           0.46        35
   macro avg       0.40      0.38      0.37        35
weighted avg       0.42      0.46      0.42        35


Default Sentence Transformer Report:
              precision    recall  f1-score   support

           0       0.43      0.30      0.35        10
           1       0.45      0.59      0.51        17
           2       0.50      0.38      0.43         8

    accuracy                           0.46        35
   macro avg       0.46      0.42      0.43        35
weighted avg       0.46      0.46      0.45        35


Tuned Sentence Transformer Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.39      0.65      0.49        17
           2       0.20      0.12      0.15         8

    accuracy                           0.34        35
   macro avg       0.20      0.26      0.21        35
weighted avg       0.24      0.34      0.27        35

--------------------------------------------------

Model Performance Summary :

  • We have built 15 models( 9 Random Forest models and 6 XGBoost Model) using Word2vec, Glove and Sentence Transformer embedding. Our metric of interest is Recall.
  • We observe that there are top 4 models with the highest recall score of 46%. Below are the models

    • Random Forest - Default Glove Model
    • Random Forest - Weighted Sentence Transformer Model
    • XGBoost - Tuned Glove Model
    • XGBoost - Default Sentence Transformer Model
    • Lets dive further to look at other factors to choose the right model

      • Class-wise Recall :

        • Random Forest - Default Glove Model
          • Negative sentiment recall score is 20%
          • Neutral sentiment recall score is 76%
          • Positive sentiment recall score is 12%
        • Random Forest - Weighted Sentence Transformer Model
          • Negative sentiment recall score is 10%
          • Neutral sentiment recall score is 82%
          • Positive sentiment recall score is 12%
        • XGBoost - Tuned Glove Model
          • Negative sentiment recall score is 30%
          • Neutral sentiment recall score is 71%
          • Positive sentiment recall score is 12%
        • XGBoost - Default Sentence Transformer Model

          • Negative sentiment recall score is 30%
          • Neutral sentiment recall score is 59%
          • Positive sentiment recall score is 38%

          By Observation, we see XGBoost - Default Sentence Transformer Model class Wise recall score for Negative and positive sentiments are better than other models.

      • Computational Efficiency :

        • Default Models - need the least computational resources
        • Weighted Models - Second least resources after the default Models
        • Hyper parameter tuning Models- needs more computational resources

        • For Sentence Transformers, we do not have to pre-process or clean the data, we can feed the input as is which saves computational resource

          By Observation, we see that XGBoost - Default Sentence Transformer Model requires less computational resource

      • Other Metrics like F1-Score :

        • The Second critical métricas is F1-Score which keeps a balance between Precision and Recall.
        • Random Forest - Default Glove Model
          • F1- Score is 40%
        • Random Forest - Weighted Sentence Transformer Model
          • F1-Score is 38%
        • XGBoost - Tuned Glove Model
          • F1-Score is 42%
        • XGBoost - Default Sentence Transformer Model

          • F1-Score is 45%

          By observation, we see that XGBoost - Default Sentence Transformer Model has the highest F1-Score of 45%

Final Model: Based on all the criteria above, XGBoost - Default Sentence Transformer Model is the best model.

Final Model Performance on Test Data

In [190]:
# Predicting on validation data
y_pred_test_xgb_sent_transformer = xgb_sent_transformer.predict(X_test_sent_transformer)

Confusion Matrix

In [192]:
#XGBoost doesnt take negative values/classes in y. So map the classes to positive number
y_test_xgb_sent_transformer = y_test_sent_transformer.map({-1: 0, 0: 1, 1: 2})  # Change -1 to 0, 0 to 1, and 1 to 2
In [193]:
confusion_matrix_sklearn(xgb_sent_transformer, X_test_sent_transformer, y_test_xgb_sent_transformer)

Classification Report

In [194]:
#included zero_division=1 to address the warning related to precision values
final_model_report = classification_report(y_test_xgb_sent_transformer, y_pred_test_xgb_sent_transformer, zero_division=1)
print(final_model_report)
              precision    recall  f1-score   support

           0       0.50      0.40      0.44        10
           1       0.48      0.59      0.53        17
           2       0.33      0.25      0.29         8

    accuracy                           0.46        35
   macro avg       0.44      0.41      0.42        35
weighted avg       0.45      0.46      0.45        35

Final Model Summary : (XGBoost - Default Sentence Transformer)

  • Weighted Recall score on test data - 46%
  • Class-wise Recall Scores
    • Negative Sentiment - 40%
    • Neutral Sentiment - 59%
    • Positive Sentiment - 25%
  • F1-Score
    • F1-Score is 45%

Conclusion : Model has generalized well and has given the similar performance as validation dataset.

Weekly News Summarization

Important Note: It is recommended to run this section of the project independently from the previous sections in order to avoid runtime crashes due to RAM overload.

Installing and Importing the necessary libraries

In [1]:
!pip install git+https://github.com/abetlen/llama-cpp-python.git
Collecting git+https://github.com/abetlen/llama-cpp-python.git
  Cloning https://github.com/abetlen/llama-cpp-python.git to /tmp/pip-req-build-upp4trzd
  Running command git clone --filter=blob:none --quiet https://github.com/abetlen/llama-cpp-python.git /tmp/pip-req-build-upp4trzd
  Resolved https://github.com/abetlen/llama-cpp-python.git to commit b1d23df0bbd327b774083b5cf88e67ca0dd52b92
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.11/dist-packages (from llama_cpp_python==0.3.9) (4.13.2)
Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.11/dist-packages (from llama_cpp_python==0.3.9) (2.0.2)
Collecting diskcache>=5.6.1 (from llama_cpp_python==0.3.9)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: jinja2>=2.11.3 in /usr/local/lib/python3.11/dist-packages (from llama_cpp_python==0.3.9) (3.1.6)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2>=2.11.3->llama_cpp_python==0.3.9) (3.0.2)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 kB 2.0 MB/s eta 0:00:00
Building wheels for collected packages: llama_cpp_python
  Building wheel for llama_cpp_python (pyproject.toml) ... done
  Created wheel for llama_cpp_python: filename=llama_cpp_python-0.3.9-cp311-cp311-linux_x86_64.whl size=4066894 sha256=144295e26a9a886092819ad3f1f388551fbaa2078de3ae0694b3bbee656212eb
  Stored in directory: /tmp/pip-ephem-wheel-cache-zgyheaus/wheels/01/08/fb/81c44fda474774fbb40f7b407f3d53c6554c77fc88cd8774ac
Successfully built llama_cpp_python
Installing collected packages: diskcache, llama_cpp_python
Successfully installed diskcache-5.6.3 llama_cpp_python-0.3.9
In [ ]:
# Installation for GPU llama-cpp-python
# uncomment and run the following code in case GPU is being used
#!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 -q

# Installation for CPU llama-cpp-python
# uncomment and run the following code in case GPU is not being used
#!CMAKE_ARGS="-DLLAMA_CUBLAS=off" FORCE_CMAKE=1 pip install llama-cpp-python -q
In [2]:
# Function to download the model from the Hugging Face model hub
from huggingface_hub import hf_hub_download

# Importing the Llama class from the llama_cpp module
from llama_cpp import Llama

# Importing the library for data manipulation
import pandas as pd

from tqdm import tqdm # For progress bar related functionalities
tqdm.pandas()

# to ignore unnecessary warnings
import warnings
warnings.filterwarnings("ignore")

Loading the data

In [3]:
# mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [4]:
# loading the dataset
df_summarization = pd.read_csv('/content/drive/My Drive/Colab Notebooks/NLP/Project/stock_news.csv')
In [5]:
data_summarization = df_summarization.copy()

Loading the model

In [26]:
import torch
from llama_cpp import Llama

# Check if CUDA is available
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
In [27]:
print(device)
cuda
In [6]:
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
model_basename = "mistral-7b-instruct-v0.2.Q6_K.gguf"
In [7]:
# Using hf_hub_download to download a model from the Hugging Face model hub
# The repo_id parameter specifies the model name or path in the Hugging Face repository
# The filename parameter specifies the name of the file to download
model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename
)
In [33]:
llm = Llama(
    model_path=model_path,
    n_threads=2,  # CPU cores
    n_batch=512,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=43,  # Change this value based on your model and your GPU VRAM pool.
    n_ctx=5500,  # Context window
)
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /root/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF/snapshots/3a6fbf4a41a1d52e415a4958cde6856d34b2db93/mistral-7b-instruct-v0.2.Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 18
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q6_K:  226 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 5.53 GiB (6.56 BPW) 
init_tokenizer: initializing tokenizer for type 1
load: control token:      2 '</s>' is not marked as EOG
load: control token:      1 '<s>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1637 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 7.24 B
print_info: general.name     = mistralai_mistral-7b-instruct-v0.2
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 0
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 0
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 0
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 0
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 0
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device CPU, is_swa = 0
load_tensors: layer  19 assigned to device CPU, is_swa = 0
load_tensors: layer  20 assigned to device CPU, is_swa = 0
load_tensors: layer  21 assigned to device CPU, is_swa = 0
load_tensors: layer  22 assigned to device CPU, is_swa = 0
load_tensors: layer  23 assigned to device CPU, is_swa = 0
load_tensors: layer  24 assigned to device CPU, is_swa = 0
load_tensors: layer  25 assigned to device CPU, is_swa = 0
load_tensors: layer  26 assigned to device CPU, is_swa = 0
load_tensors: layer  27 assigned to device CPU, is_swa = 0
load_tensors: layer  28 assigned to device CPU, is_swa = 0
load_tensors: layer  29 assigned to device CPU, is_swa = 0
load_tensors: layer  30 assigned to device CPU, is_swa = 0
load_tensors: layer  31 assigned to device CPU, is_swa = 0
load_tensors: layer  32 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q6_K) (and 290 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
load_tensors:   CPU_Mapped model buffer size =  5666.09 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 5500
llama_context: n_ctx_per_seq = 5500
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (5500) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.12 MiB
create_memory: n_ctx = 5504 (padded)
llama_kv_cache_unified: kv_size = 5504, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
llama_kv_cache_unified: layer   0: dev = CPU
llama_kv_cache_unified: layer   1: dev = CPU
llama_kv_cache_unified: layer   2: dev = CPU
llama_kv_cache_unified: layer   3: dev = CPU
llama_kv_cache_unified: layer   4: dev = CPU
llama_kv_cache_unified: layer   5: dev = CPU
llama_kv_cache_unified: layer   6: dev = CPU
llama_kv_cache_unified: layer   7: dev = CPU
llama_kv_cache_unified: layer   8: dev = CPU
llama_kv_cache_unified: layer   9: dev = CPU
llama_kv_cache_unified: layer  10: dev = CPU
llama_kv_cache_unified: layer  11: dev = CPU
llama_kv_cache_unified: layer  12: dev = CPU
llama_kv_cache_unified: layer  13: dev = CPU
llama_kv_cache_unified: layer  14: dev = CPU
llama_kv_cache_unified: layer  15: dev = CPU
llama_kv_cache_unified: layer  16: dev = CPU
llama_kv_cache_unified: layer  17: dev = CPU
llama_kv_cache_unified: layer  18: dev = CPU
llama_kv_cache_unified: layer  19: dev = CPU
llama_kv_cache_unified: layer  20: dev = CPU
llama_kv_cache_unified: layer  21: dev = CPU
llama_kv_cache_unified: layer  22: dev = CPU
llama_kv_cache_unified: layer  23: dev = CPU
llama_kv_cache_unified: layer  24: dev = CPU
llama_kv_cache_unified: layer  25: dev = CPU
llama_kv_cache_unified: layer  26: dev = CPU
llama_kv_cache_unified: layer  27: dev = CPU
llama_kv_cache_unified: layer  28: dev = CPU
llama_kv_cache_unified: layer  29: dev = CPU
llama_kv_cache_unified: layer  30: dev = CPU
llama_kv_cache_unified: layer  31: dev = CPU
llama_kv_cache_unified:        CPU KV buffer size =   688.00 MiB
llama_kv_cache_unified: KV self size  =  688.00 MiB, K (f16):  344.00 MiB, V (f16):  344.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:        CPU compute buffer size =   386.76 MiB
llama_context: graph nodes  = 1094
llama_context: graph splits = 1
CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'llama.context_length': '32768', 'general.name': 'mistralai_mistral-7b-instruct-v0.2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '18'}
Available chat formats from metadata: chat_template.default
Guessed chat format: mistral-instruct

Aggregating the data weekly

In [9]:
data_summarization["Date"] = pd.to_datetime(data_summarization['Date'])  # Convert the 'Date' column to datetime format.
In [10]:
# Group the data by week using the 'Date' column.
weekly_grouped = data_summarization.groupby(pd.Grouper(key='Date', freq='W'))
In [11]:
weekly_grouped = weekly_grouped.agg(
    {
        'News': lambda x: ' || '.join(x)  # Join the news values with ' || ' separator.
    }
).reset_index()

print(weekly_grouped.shape)
(18, 2)
In [12]:
weekly_grouped
Out[12]:
Date News
0 2019-01-06 The tech sector experienced a significant dec...
1 2019-01-13 Sprint and Samsung plan to release 5G smartph...
2 2019-01-20 The U.S. stock market declined on Monday as c...
3 2019-01-27 The Swiss National Bank (SNB) governor, Andre...
4 2019-02-03 Caterpillar Inc reported lower-than-expected ...
5 2019-02-10 The Dow Jones Industrial Average, S&P 500, an...
6 2019-02-17 This week, the European Union's second highes...
7 2019-02-24 This news article discusses progress towards ...
8 2019-03-03 The Dow Jones Industrial Average and other ma...
9 2019-03-10 Spotify, the world's largest paid music strea...
10 2019-03-17 The United States opposes France's digital se...
11 2019-03-24 Facebook's stock price dropped more than 3% o...
12 2019-03-31 This news article reports that the S&P 500 In...
13 2019-04-07 Apple and other consumer brands, including LV...
14 2019-04-14 In March, mobile phone shipments to China dro...
15 2019-04-21 The chairman of Taiwan's Foxconn, Terry Gou, ...
16 2019-04-28 Taiwan's export orders continued to decline f...
17 2019-05-05 Spotify reported better-than-expected Q1 reve...
In [13]:
# creating a copy of the data
data_1 = weekly_grouped.copy()

Summarization

Note:

  • The model is expected to summarize the news from the week by identifying the top three positive and negative events that are most likely to impact the price of the stock.

  • As an output, the model is expected to return a JSON containing two keys, one for Positive Events and one for Negative Events.

For the project, we need to define the prompt to be fed to the LLM to help it understand the task to perform. The following should be the components of the prompt:

  1. Role: Specifies the role the LLM will be taking up to perform the specified task, along with any specific details regarding the role

    • Example: You are an expert data analyst specializing in news content analysis.
  2. Task: Specifies the task to be performed and outlines what needs to be accomplished, clearly defining the objective

    • Example: Analyze the provided news headline and return the main topics contained within it.
  3. Instructions: Provides detailed guidelines on how to perform the task, which includes steps, rules, and criteria to ensure the task is executed correctly

    • Example:
Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.
  1. Output Format: Specifies the format in which the final response should be structured, ensuring consistency and clarity in the generated output

    • Example: Return the output in JSON format with keys as the topic number and values as the actual topic.

Full Prompt Example:

You are an expert data analyst specializing in news content analysis.

Task: Analyze the provided news headline and return the main topics contained within it.

Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.

Return the output in JSON format with keys as the topic number and values as the actual topic.

Sample Output:

{"1": "Politics", "2": "Economy", "3": "Health" }

Note:

  • The model is expected to summarize the news from the week by identifying the top three positive and negative events that are most likely to impact the price of the stock.

  • As an output, the model is expected to return a JSON containing two keys, one for Positive Events and one for Negative Events.

For the project, we need to define the prompt to be fed to the LLM to help it understand the task to perform. The following should be the components of the prompt:

  1. Role: Specifies the role the LLM will be taking up to perform the specified task, along with any specific details regarding the role

    • Example: You are an expert data analyst specializing in news content analysis.
  2. Task: Specifies the task to be performed and outlines what needs to be accomplished, clearly defining the objective

    • Example: Analyze the provided news headline and return the main topics contained within it.
  3. Instructions: Provides detailed guidelines on how to perform the task, which includes steps, rules, and criteria to ensure the task is executed correctly

    • Example:
Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.
  1. Output Format: Specifies the format in which the final response should be structured, ensuring consistency and clarity in the generated output

    • Example: Return the output in JSON format with keys as the topic number and values as the actual topic.

Full Prompt Example:

You are an expert data analyst specializing in news content analysis.

Task: Analyze the provided news headline and return the main topics contained within it.

Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.

Return the output in JSON format with keys as the topic number and values as the actual topic.

Sample Output:

{"1": "Politics", "2": "Economy", "3": "Health" }

Utility Functions
In [14]:
# defining a function to parse the JSON output from the model
def extract_json_data(json_str):
    import json
    try:
        # Find the indices of the opening and closing curly braces
        json_start = json_str.find('{')
        json_end = json_str.rfind('}')

        if json_start != -1 and json_end != -1:
            extracted_category = json_str[json_start:json_end + 1]  # Extract the JSON object
            data_dict = json.loads(extracted_category)
            return data_dict
        else:
            print(f"Warning: JSON object not found in response: {json_str}")
            return {}
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        return {}
In [17]:
import nltk

# Download the 'punkt_tab' resource
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize

token_counts = [len(word_tokenize(text)) for text in data_1['News']]
max_tokens = max(token_counts)
print(max_tokens)
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
2902
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
Defining the response function
In [45]:
Instruction_1 = """
Role: You are an expert financial analyst specializing in market sentiment analysis and news summarization. Your primary role is to analyze weekly news articles related to a specific company and determine the top three positive and negative events that are most likely to affect its stock price.

Task: Analyze the provided news articles for the past week and identify the top three positive and negative events. These events should be the most significant occurrences reported in the news that could potentially influence the company's stock price. Summarize these events concisely and objectively.

Instructions:

1. Carefully read each news article provided for the specified week.
2. Extract key events or topics discussed in the articles.
3. Categorize the events as positive or negative based on their potential impact on the company's stock price. For example, a new product launch would generally be considered a positive event, while a product recall would be considered a negative event.
4. Rank the positive and negative events based on their significance and potential impact.
5. Select the top three most impactful positive events and the top three most impactful negative events.
6. Summarize each selected event in a clear and concise manner, avoiding subjective opinions or interpretations. Focus on factual reporting and avoid speculation.
7.Present the summarized events in a JSON format with two keys: "Positive Events" and "Negative Events." Each key should contain a list of the three summarized events in order of impact.

Example JSON Output:
{
  "Positive Events": [
    "Company announced a strategic partnership with a major industry player, potentially expanding its market reach.",
    "Positive earnings report exceeding analysts' expectations, indicating strong financial performance.",
    "New product launch receiving positive reviews and generating significant customer interest."
  ],
  "Negative Events": [
    "Product recall due to safety concerns, impacting sales and brand reputation.",
    "Regulatory investigation initiated against the company, potentially leading to fines or penalties.",
    "Key executive unexpectedly resigned, raising concerns about leadership stability."
  ]
}
"""
In [46]:
#length of the instructions
len(Instruction_1)
Out[46]:
2228
In [47]:
#Defining the response function
def response_mistral(prompt, news):
    model_output = llm(
      f"""
      [INST]
      {prompt}
      News Articles: {news}
      [/INST]
      """,
      max_tokens=5500, #Complete the code to set the maximum number of tokens the model should generate for this task.
      temperature=0, #Complete the code to set the value for temperature.
      top_p=0.95, #Complete the code to set the value for top_p
      top_k=50, #Complete the code to set the value for top_k
      echo=False,
    )

    final_output = model_output["choices"][0]["text"]

    return final_output
Checking the model output on a sample

Note: Use this section to test out the prompt with one instance before using it for the entire weekly data.

In [48]:
#prompt with one instance
#use iloc to get one week
#create dataframe to store the output
test_data = response_mistral(Instruction_1, data_1['News'][0])
Llama.generate: 342 prefix-match hit, remaining 3877 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  892804.63 ms /  6949 tokens (  128.48 ms per token,     7.78 tokens per second)
llama_perf_context_print:        eval time =  167820.69 ms /   345 runs   (  486.44 ms per token,     2.06 tokens per second)
llama_perf_context_print:       total time =  657331.89 ms /  7294 tokens
In [36]:
print(test_data)
 {
          "Positive Events": [
            "Roku Inc announced plans to offer premium video channels on a subscription basis through its free streaming service, The Roku Channel.",
            "FDIC Chair, Jelena McWilliams, expressed no concern over market volatility affecting the U.S banking system.",
            "Oil prices rebounded on Thursday due to dollar weakness, signs of output cuts by Saudi Arabia, and weaker fuel oil margins leading Riyadh to lower February prices for heavier crude grades sold to Asia."
          ],
          "Negative Events": [
            "Apple cut its quarterly revenue forecast for the first time in over 15 years due to weak iPhone sales in China, representing around 20% of Apple's revenue.",
            "Apple's profit warning led to an estimated $3.8 billion paper loss for Berkshire Hathaway due to its $252 million stake in Apple.",
            "Delta Air Lines reported lower-than-expected fourth quarter unit revenue growth, citing weaker than anticipated late bookings and increased competition."
          ]
        }

        This JSON output summarizes the top three positive and negative events reported in the news articles for the specified week. The positive events include Roku's announcement of premium video channels, the FDIC Chair's reassuring statement about market volatility, and the rebound of oil prices. The negative events include Apple's revenue warning and its impact on Berkshire Hathaway, and Delta Air Lines' lower-than-expected unit revenue growth.
In [49]:
import pandas as pd
from IPython.display import display

# Assuming data_1 is your DataFrame and 'News' is the column containing news articles

# Set display options to show all text
pd.set_option("display.max_colwidth", None)  # Display full column width
In [50]:
# Display the entire text of the first news article without truncation
display(data_1['News'][0])
' The tech sector experienced a significant decline in the aftermarket following Apple\'s Q1 revenue warning. Notable suppliers, including Skyworks, Broadcom, Lumentum, Qorvo, and TSMC, saw their stocks drop in response to Apple\'s downward revision of its revenue expectations for the quarter, previously announced in January. ||  Apple lowered its fiscal Q1 revenue guidance to $84 billion from earlier estimates of $89-$93 billion due to weaker than expected iPhone sales. The announcement caused a significant drop in Apple\'s stock price and negatively impacted related suppliers, leading to broader market declines for tech indices such as Nasdaq 10 ||  Apple cut its fiscal first quarter revenue forecast from $89-$93 billion to $84 billion due to weaker demand in China and fewer iPhone upgrades. CEO Tim Cook also mentioned constrained sales of Airpods and Macbooks. Apple\'s shares fell 8.5% in post market trading, while Asian suppliers like Hon ||  This news article reports that yields on long-dated U.S. Treasury securities hit their lowest levels in nearly a year on January 2, 2019, due to concerns about the health of the global economy following weak economic data from China and Europe, as well as the partial U.S. government shutdown. Apple ||  Apple\'s revenue warning led to a decline in USD JPY pair and a gain in Japanese yen, as investors sought safety in the highly liquid currency. Apple\'s underperformance in Q1, with forecasted revenue of $84 billion compared to analyst expectations of $91.5 billion, triggered risk aversion mood in markets || Apple CEO Tim Cook discussed the company\'s Q1 warning on CNBC, attributing US-China trade tensions as a factor. Despite not mentioning iPhone unit sales specifically, Cook indicated Apple may comment on them again. Services revenue is projected to exceed $10.8 billion in Q1. Cook also addressed the lack of ||  Roku Inc has announced plans to offer premium video channels on a subscription basis through its free streaming service, The Roku Channel. Partners include CBS Corp\'s Showtime, Lionsgate\'s Starz, and Viacom Inc\'s Noggin. This model follows Amazon\'s successful Channels business, which generated an estimated ||  Wall Street saw modest gains on Wednesday but were threatened by fears of a global economic slowdown following Apple\'s shocking revenue forecast cut, blaming weak demand in China. The tech giant\'s suppliers and S&P 500 futures also suffered losses. Reports of decelerating factory activity in China and the euro zone ||  Apple\'s fiscal first quarter revenue came in below analysts\' estimates at around $84 billion, a significant drop from the forecasted range of $89-$93 billion. The tech giant attributed the shortfall to lower iPhone revenue and upgrades, as well as weakness in emerging markets. Several brokerages had already reduced their production estimates ||  Apple Inc. lowered its quarterly sales forecast for the fiscal first quarter, underperforming analysts\' expectations due to slowing Chinese economy and trade tensions. The news sent Apple shares tumbling and affected Asia-listed suppliers like Hon Hai Precision Industry Co Ltd, Taiwan Semiconductor Manufacturing Company, and LG Innot ||  The Australian dollar experienced significant volatility on Thursday, plunging to multi-year lows against major currencies due to automated selling, liquidity issues, and a drought of trades. The largest intra-day falls in the Aussie\'s history occurred amid violent movements in AUD/JPY and AUD/ ||  In early Asian trading on Thursday, the Japanese yen surged as the U.S. dollar and Australian dollar collapsed in thin markets due to massive stop loss sales triggered by Apple\'s earnings warning of sluggish iPhone sales in China and risk aversion. The yen reached its lowest levels against the U.S. dollar since March  ||  The dollar fell from above 109 to 106.67 after Apple\'s revenue warning, while the 10-year Treasury yield also dropped to 2.61%. This followed money flowing into US government paper. Apple\'s shares and U.S. stock index futures declined, with the NAS ||  RBC Capital maintains its bullish stance on Apple, keeping its Outperform rating and $220 price target. However, analyst Amit Daryanani warns of ongoing iPhone demand concerns, which could impact pricing power and segmentation efforts if severe. He suggests potential capital allocation adjustments if the stock underperforms for several quarters ||  Oil prices dropped on Thursday as investor sentiment remained affected by China\'s economic slowdown and turmoil in stock and currency markets. US WTI Crude Oil fell by $2.10 to $45.56 a barrel, while International Brent Oil was down $1.20 at $54.26 ||  In this news article, investors\' concerns about a slowing Chinese and global economy, amplified by Apple\'s revenue warning, led to a significant surge in the Japanese yen. The yen reached its biggest one-day rise in 20 months, with gains of over 4% versus the dollar. This trend was driven by automated ||  In Asia, gold prices rose to over six-month highs on concerns of a global economic slowdown and stock market volatility. Apple lowered its revenue forecast for the first quarter, leading Asian stocks to decline and safe haven assets like gold and Japanese yen to gain. Data showed weakened factory activity in Asia, particularly China, adding to ||  Fears of a global economic slowdown led to a decline in the US dollar on Thursday, as the yen gained ground due to its status as a safe haven currency. The USD index slipped below 96, and USD JPY dropped to 107.61, while the yen strengthened by 4.4%. ||  In Thursday trading, long-term US Treasury yields dropped significantly below 2.6%, reaching levels not seen in over a year, as investors shifted funds from stocks to bonds following Apple\'s warning of decreased revenue due to emerging markets and China\'s impact on corporate profits, with the White House advisor adding to concerns of earnings down ||  Gold prices have reached their highest level since mid-June, with the yellow metal hitting $1,291.40 per ounce due to investor concerns over a slowing economy and Apple\'s bearish revenue outlook. Saxo Bank analyst Ole Hansen predicts gold may reach $1,300 sooner ||  Wedbush analyst Daniel Ives lowered his price target for Apple from $275 to $200 due to concerns over potential iPhone sales stagnation, with an estimated 750 million active iPhones worldwide that could cease growing or even decline. He maintains an Outperform rating and remains bullish on the long ||  Oil prices rebounded on Thursday due to dollar weakness, signs of output cuts by Saudi Arabia, and weaker fuel oil margins leading Riyadh to lower February prices for heavier crude grades sold to Asia. The Organization of the Petroleum Exporting Countries (OPEC) led by Saudi Arabia and other producers ||  This news article reports on the impact of Apple\'s Q1 revenue warning on several tech and biotech stocks. Sesen Bio (SESN) and Prana Biotechnology (PRAN) saw their stock prices drop by 28% and 11%, respectively, following the announcement. Mellanox Technologies (ML ||  Gold prices reached within $5 of $1,300 on Thursday as weak stock markets and a slumping dollar drove investors towards safe-haven assets. The U.S. stock market fell about 2%, with Apple\'s rare profit warning adding to investor unease. COMEX gold futures settled at $1 ||  The FDIC Chair, Jelena McWilliams, expressed no concern over market volatility affecting the U.S banking system due to banks\' ample capital. She also mentioned a review of the CAMELS rating system used to evaluate bank health for potential inconsistencies and concerns regarding forum shopping. This review comes from industry ||  Apple cut its quarterly revenue forecast for the first time in over 15 years due to weak iPhone sales in China, representing around 20% of Apple\'s revenue. This marks a significant downturn during Tim Cook\'s tenure and reflects broader economic concerns in China exacerbated by trade tensions with the US. U ||  Goldman analyst Rod Hall lowered his price target for Apple from $182 to $140, citing potential risks to the tech giant\'s 2019 numbers due to uncertainties in Chinese demand. He reduced his revenue estimate for the year by $6 billion and EPS forecast by $1.54 ||  Delta Air Lines lowered its fourth-quarter revenue growth forecast to a range of 3% from the previous estimate of 3% to 5%. Earnings per share are now expected to be $1.25 to $1.30. The slower pace of improvement in late December was unexpected, and Delta cited this as ||  Apple\'s profit warning has significantly impacted the stock market and changed the outlook for interest rates. The chance of a rate cut in May has increased to 15-16% from just 3%, according to Investing com\'s Fed Rate Monitor Tool. There is even a 1% chance of two cuts in May. ||  The White House advisor, Kevin Hassett, stated that a decline in Chinese economic growth would negatively impact U.S. firm profits but recover once a trade deal is reached between Washington and Beijing. He also noted that Asian economies, including China, have been experiencing significant slowdowns since last spring due to U.S. tariffs ||  The White House economic adviser, Kevin Hassett, warned that more companies could face earnings downgrades due to ongoing trade negotiations between the U.S. and China, leading to a decline in oil prices on Thursday. WTI crude fell 44 cents to $44.97 a barrel, while Brent crude inched ||  Japanese stocks suffered significant losses on the first trading day of 2019, with the Nikkei 225 and Topix indices both falling over 3 percent. Apple\'s revenue forecast cut, citing weak iPhone sales in China, triggered global growth concerns and sent technology shares tumbling. The S&P 50 ||  Investors withdrew a record $98 billion from U.S. stock funds in December, with fears of aggressive monetary policy and an economic slowdown driving risk reduction. The S&P 500 fell 9% last month, with some seeing declines as a buying opportunity. Apple\'s warning of weak iPhone sales added ||  Apple\'s Q1 revenue guidance cut, resulting from weaker demand in China, led to an estimated $3.8 billion paper loss for Berkshire Hathaway due to its $252 million stake in Apple. This news, coupled with broad market declines, caused a significant $21.4 billion decrease in Berk ||  This news article reports that a cybersecurity researcher, Wish Wu, planned to present at the Black Hat Asia hacking conference on how to bypass Apple\'s Face ID biometric security on iPhones. However, his employer, Ant Financial, which operates Alipay and uses facial recognition technologies including Face ID, asked him to withdraw ||  OPEC\'s production cuts faced uncertainty as oil prices were influenced by volatile stock markets, specifically due to Apple\'s lowered revenue forecast and global economic slowdown fears. US WTI and Brent crude both saw gains, but these were checked by stock market declines. Shale production is expected to continue impacting the oil market in ||  Warren Buffett\'s Berkshire Hathaway suffered significant losses in the fourth quarter due to declines in Apple, its largest common stock investment. Apple cut its revenue forecast, causing a 5-6% decrease in Berkshire\'s Class A shares. The decline resulted in potential unrealized investment losses and could push Berk ||  This news article reports that on Thursday, the two-year Treasury note yield dropped below the Federal Reserve\'s effective rate for the first time since 2008. The market move suggests investors believe the Fed will not be able to continue tightening monetary policy. The drop in yields was attributed to a significant decline in U.S ||  The U.S. and China will hold their first face-to-face trade talks since agreeing to a 90-day truce in their trade war last month. Deputy U.S. Trade Representative Jeffrey Gerrish will lead the U.S. delegation for negotiations on Jan. 7 and 8, ||  Investors bought gold in large quantities due to concerns over a global economic slowdown, increased uncertainty in the stock market, and potential Fed rate hikes. The precious metal reached its highest price since June, with gold ETF holdings also seeing significant increases. Factors contributing to this demand include economic downturn, central bank policy mistakes, and ||  Delta Air Lines Inc reported lower-than-expected fourth quarter unit revenue growth, citing weaker than anticipated late bookings and increased competition. The carrier now expects total revenue per available seat mile to rise about 3 percent in the period, down from its earlier forecast of 3.5 percent growth. Fuel prices are also expected to ||  U.S. stocks experienced significant declines on Thursday as the S&P 500 dropped over 2%, the Dow Jones Industrial Average fell nearly 3%, and the Nasdaq Composite lost approximately 3% following a warning of weak revenue from Apple and indications of slowing U.S. factory activity, raising concerns ||  President Trump expressed optimism over potential trade talks with China, citing China\'s current economic weakness as a potential advantage for the US. This sentiment was echoed by recent reports of weakened demand for Apple iPhones in China, raising concerns about the overall health of the Chinese economy. The White House is expected to take a strong stance in ||  Qualcomm secured a court order in Germany banning the sale of some iPhone models due to patent infringement, leading Apple to potentially remove these devices from its stores. However, third-party resellers like Gravis continue selling the affected iPhones. This is the third major effort by Qualcomm to ban Apple\'s iPhones glob ||  Oil prices rose on Friday in Asia as China confirmed trade talks with the U.S., with WTI gaining 0.7% to $47.48 and Brent increasing 0.7% to $56.38 a barrel. The gains came after China\'s Commerce Ministry announced that deputy U.S. Trade ||  Gold prices surged past the psychologically significant level of $1,300 per ounce in Asia on Friday due to growing concerns over a potential global economic downturn. The rise in gold was attributed to weak PMI data from China and Apple\'s reduced quarterly sales forecast. Investors viewed gold as a safe haven asset amidst ||  In an internal memo, Huawei\'s Chen Lifang reprimanded two employees for sending a New Year greeting on the company\'s official Twitter account using an iPhone instead of a Huawei device. The incident caused damage to the brand and was described as a "blunder" in the memo. The mistake occurred due to ||  This news article reports on the positive impact of trade war talks between Beijing and Washington on European stock markets, specifically sectors sensitive to the trade war such as carmakers, industrials, mining companies, and banking. Stocks rallied with mining companies leading the gains due to copper price recovery. Bayer shares climbed despite a potential ruling restricting || Amazon has sold over 100 million devices with its Alexa digital assistant, according to The Verge. The company is cautious about releasing hardware sales figures and did not disclose holiday numbers for the Echo Dot. Over 150 products feature Alexa, and more than 28,000 smart home || The Supreme Court will review Broadcom\'s appeal in a shareholder lawsuit over the 2015 acquisition of Emulex. The case hinges on whether intent to defraud is required for such lawsuits, and the decision could extend beyond the Broadcom suit. An Emulex investor filed a class action lawsuit ||  The Chinese central bank announced a fifth reduction in the required reserve ratio (RRR) for banks, freeing up approximately 116.5 billion yuan for new lending. This follows mounting concerns about China\'s economic health amid slowing domestic demand and U.S. tariffs on exports. Premier Li Keqiang || The stock market rebounded strongly on Friday following positive news about US-China trade talks, a better-than-expected jobs report, and dovish comments from Federal Reserve Chairman Jerome Powell. The Dow Jones Industrial Average rose over 746 points, with the S&P 500 and Nasdaq Com'
Checking the model output on the weekly data
In [51]:
data_1['model_response'] = data_1['News'].progress_apply(lambda x: response_mistral(Instruction_1, x))
  0%|          | 0/18 [00:00<?, ?it/s]Llama.generate: 4218 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =  190980.76 ms /   387 runs   (  493.49 ms per token,     2.03 tokens per second)
llama_perf_context_print:       total time =  191271.81 ms /   388 tokens
 11%|█         | 2/18 [03:11<25:30, 95.65s/it]Llama.generate: 474 prefix-match hit, remaining 2234 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  277458.09 ms /  2234 tokens (  124.20 ms per token,     8.05 tokens per second)
llama_perf_context_print:        eval time =  193923.86 ms /   424 runs   (  457.37 ms per token,     2.19 tokens per second)
llama_perf_context_print:       total time =  471714.38 ms /  2658 tokens
 17%|█▋        | 3/18 [11:03<1:03:05, 252.35s/it]Llama.generate: 474 prefix-match hit, remaining 2089 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  262025.55 ms /  2089 tokens (  125.43 ms per token,     7.97 tokens per second)
llama_perf_context_print:        eval time =  160181.34 ms /   356 runs   (  449.95 ms per token,     2.22 tokens per second)
llama_perf_context_print:       total time =  422464.98 ms /  2445 tokens
 22%|██▏       | 4/18 [18:05<1:13:41, 315.83s/it]Llama.generate: 475 prefix-match hit, remaining 1475 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  177113.54 ms /  1475 tokens (  120.08 ms per token,     8.33 tokens per second)
llama_perf_context_print:        eval time =  206170.35 ms /   467 runs   (  441.48 ms per token,     2.27 tokens per second)
llama_perf_context_print:       total time =  383672.47 ms /  1942 tokens
 28%|██▊       | 5/18 [24:29<1:13:32, 339.42s/it]Llama.generate: 474 prefix-match hit, remaining 2587 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  321982.51 ms /  2587 tokens (  124.46 ms per token,     8.03 tokens per second)
llama_perf_context_print:        eval time =  121079.04 ms /   263 runs   (  460.38 ms per token,     2.17 tokens per second)
llama_perf_context_print:       total time =  443231.84 ms /  2850 tokens
 33%|███▎      | 6/18 [31:52<1:14:46, 373.88s/it]Llama.generate: 474 prefix-match hit, remaining 1083 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  129112.43 ms /  1083 tokens (  119.22 ms per token,     8.39 tokens per second)
llama_perf_context_print:        eval time =  118367.17 ms /   276 runs   (  428.87 ms per token,     2.33 tokens per second)
llama_perf_context_print:       total time =  247663.56 ms /  1359 tokens
 39%|███▉      | 7/18 [36:00<1:01:06, 333.29s/it]Llama.generate: 474 prefix-match hit, remaining 1271 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  151921.79 ms /  1271 tokens (  119.53 ms per token,     8.37 tokens per second)
llama_perf_context_print:        eval time =  238298.58 ms /   548 runs   (  434.85 ms per token,     2.30 tokens per second)
llama_perf_context_print:       total time =  390705.97 ms /  1819 tokens
 44%|████▍     | 8/18 [42:30<58:33, 351.37s/it]  Llama.generate: 475 prefix-match hit, remaining 570 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =   65339.64 ms /   570 tokens (  114.63 ms per token,     8.72 tokens per second)
llama_perf_context_print:        eval time =   73500.23 ms /   177 runs   (  415.26 ms per token,     2.41 tokens per second)
llama_perf_context_print:       total time =  138941.19 ms /   747 tokens
 50%|█████     | 9/18 [44:49<42:49, 285.47s/it]Llama.generate: 474 prefix-match hit, remaining 654 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =   75567.48 ms /   654 tokens (  115.55 ms per token,     8.65 tokens per second)
llama_perf_context_print:        eval time =  214686.30 ms /   511 runs   (  420.13 ms per token,     2.38 tokens per second)
llama_perf_context_print:       total time =  290689.27 ms /  1165 tokens
 56%|█████▌    | 10/18 [49:40<38:16, 287.08s/it]Llama.generate: 474 prefix-match hit, remaining 726 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =   85159.71 ms /   726 tokens (  117.30 ms per token,     8.53 tokens per second)
llama_perf_context_print:        eval time =   94751.85 ms /   225 runs   (  421.12 ms per token,     2.37 tokens per second)
llama_perf_context_print:       total time =  180049.72 ms /   951 tokens
 61%|██████    | 11/18 [52:40<29:41, 254.44s/it]Llama.generate: 474 prefix-match hit, remaining 1138 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  138688.54 ms /  1138 tokens (  121.87 ms per token,     8.21 tokens per second)
llama_perf_context_print:        eval time =  134151.19 ms /   312 runs   (  429.97 ms per token,     2.33 tokens per second)
llama_perf_context_print:       total time =  273053.46 ms /  1450 tokens
 67%|██████▋   | 12/18 [57:13<26:00, 260.09s/it]Llama.generate: 474 prefix-match hit, remaining 1263 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  154136.29 ms /  1263 tokens (  122.04 ms per token,     8.19 tokens per second)
llama_perf_context_print:        eval time =  217570.04 ms /   491 runs   (  443.12 ms per token,     2.26 tokens per second)
llama_perf_context_print:       total time =  372124.41 ms /  1754 tokens
 72%|███████▏  | 13/18 [1:03:25<24:29, 293.97s/it]Llama.generate: 474 prefix-match hit, remaining 1656 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  208779.21 ms /  1656 tokens (  126.07 ms per token,     7.93 tokens per second)
llama_perf_context_print:        eval time =  204551.65 ms /   453 runs   (  451.55 ms per token,     2.21 tokens per second)
llama_perf_context_print:       total time =  413705.23 ms /  2109 tokens
 78%|███████▊  | 14/18 [1:10:19<22:00, 330.10s/it]Llama.generate: 474 prefix-match hit, remaining 870 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  104968.74 ms /   870 tokens (  120.65 ms per token,     8.29 tokens per second)
llama_perf_context_print:        eval time =  102626.81 ms /   238 runs   (  431.21 ms per token,     2.32 tokens per second)
llama_perf_context_print:       total time =  207746.97 ms /  1108 tokens
 83%|████████▎ | 15/18 [1:13:47<14:39, 293.25s/it]Llama.generate: 474 prefix-match hit, remaining 507 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =   61836.90 ms /   507 tokens (  121.97 ms per token,     8.20 tokens per second)
llama_perf_context_print:        eval time =  229612.62 ms /   539 runs   (  426.00 ms per token,     2.35 tokens per second)
llama_perf_context_print:       total time =  291927.41 ms /  1046 tokens
 89%|████████▉ | 16/18 [1:18:39<09:45, 292.86s/it]Llama.generate: 474 prefix-match hit, remaining 1446 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  179031.31 ms /  1446 tokens (  123.81 ms per token,     8.08 tokens per second)
llama_perf_context_print:        eval time =  157515.49 ms /   361 runs   (  436.33 ms per token,     2.29 tokens per second)
llama_perf_context_print:       total time =  336805.53 ms /  1807 tokens
 94%|█████████▍| 17/18 [1:24:15<05:06, 306.07s/it]Llama.generate: 474 prefix-match hit, remaining 870 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  101330.63 ms /   870 tokens (  116.47 ms per token,     8.59 tokens per second)
llama_perf_context_print:        eval time =  195365.56 ms /   462 runs   (  422.87 ms per token,     2.36 tokens per second)
llama_perf_context_print:       total time =  297069.31 ms /  1332 tokens
100%|██████████| 18/18 [1:29:13<00:00, 303.37s/it]Llama.generate: 474 prefix-match hit, remaining 870 prompt tokens to eval
llama_perf_context_print:        load time =  536836.20 ms
llama_perf_context_print: prompt eval time =  105586.83 ms /   870 tokens (  121.36 ms per token,     8.24 tokens per second)
llama_perf_context_print:        eval time =  148597.04 ms /   345 runs   (  430.72 ms per token,     2.32 tokens per second)
llama_perf_context_print:       total time =  254433.49 ms /  1215 tokens
100%|██████████| 18/18 [1:33:27<00:00, 311.53s/it]
Formatting the model output

Extract the JSON data

In [52]:
data_1['model_response_parsed'] = data_1['model_response'].apply(extract_json_data)
In [53]:
data_1['model_response_parsed']
Out[53]:
model_response_parsed
0 {'Positive Events': ['Roku Inc announced plans to offer premium video channels on a subscription basis through its free streaming service, The Roku Channel.', 'FDIC Chair, Jelena McWilliams, expressed no concern over market volatility affecting the U.S banking system.', 'Oil prices rebounded on Thursday due to dollar weakness, signs of output cuts by Saudi Arabia, and weaker fuel oil margins leading Riyadh to lower February prices for heavier crude grades sold to Asia.'], 'Negative Events': ['Apple cut its quarterly revenue forecast for the first time in over 15 years due to weak iPhone sales in China, representing around 20% of Apple's revenue.', 'Apple's profit warning led to an estimated $3.8 billion paper loss for Berkshire Hathaway due to its $252 million stake in Apple.', 'Apple's Q1 revenue guidance cut, resulting from weaker demand in China, led to significant losses for Berkshire Hathaway and a $21.4 billion decrease in its market value.']}
1 {'Positive Events': ['Sprint and Samsung planning 5G smartphone release in nine U.S. cities, expanding market reach', 'AMS developing new 3D facial recognition features for smartphones, reducing dependence on Apple', 'Deutsche Bank upgrades Universal Music Group valuation, identifying potential suitors'], 'Negative Events': ['AMS lowers revenue growth forecast due to weak demand from smartphone makers and automotive industry', 'Chinese smartphone market experiences decline in shipments, impacting sales for companies', 'European Commission launches investigation into Nike's tax treatment in the Netherlands']}
2 {'Positive Events': ['Dialog Semiconductor reported resilient fourth quarter revenue despite a decrease in iPhone sales at main customer Apple, leading to a 4% increase in the company's shares.', 'Verizon announced an expansion of its partnership with Apple Music, making it a built-in inclusion for certain data plans, deepening their partnership.', 'Netflix announced a price increase for U.S. subscribers and strong online sales during the holiday season, leading to gains in the technology and communication services sectors.'], 'Negative Events': ['Chinese trade data showed unexpected drops in exports and imports, leading to a halt in Europe's stock market rally and losses in technology and luxury goods sectors.', 'Apple faces a ban on some iPhones in China and a patent lawsuit loss in Germany, potentially impacting sales and brand reputation.', 'Foxconn, Apple's biggest iPhone assembler, let go around 50,000 contract workers earlier than usual, raising concerns about demand and supply chain disruptions.']}
3 {'Positive Events': ['IBM reported better-than-expected earnings and revenue, with its cloud computing business contributing positively.', 'Huawei launched a new smartphone, the Honor View20, with advanced camera features and a lower price point than rivals.', 'Mastercard is determined to apply for a bankcard clearing license in China again, following American Express' success.'], 'Negative Events': ['Swiss National Bank governor emphasized the need for negative interest rates and foreign currency market intervention to prevent deflation, potentially impacting investor sentiment.', 'White House rejected a scheduled meeting with Chinese officials due to disagreements over intellectual property rules, causing cautious trading in Asian stocks.', 'Texas Instruments reported missed revenue forecasts due to weak global smartphone sales, leading to a slight increase in share price during after-hours trading.']}
4 {'Positive Events': ['Apple reported stronger-than-expected earnings for Q1 2023, with GAAP EPS coming in at $4.18 versus the estimated $4.17 and revenue surpassing expectations.', 'CVS Health's insurer, Aetna, announced a new health app for Apple Watches, called Attain, which offers customized fitness challenges and rewards.', 'Corning reported higher-than-expected revenue and profit for Q4, driven by increased demand from telecom companies investing in 5G networks.'], 'Negative Events': ['Caterpillar reported lower-than-expected fourth quarter earnings and full year 2019 outlook due to weak demand in China's construction business, causing shares to fall and pulling down U.S. stock futures.', 'Apple is expected to report lower-than-projected fiscal first quarter earnings, with revenue falling significantly due to disappointing iPhone sales in China.', '3M issued a revenue warning due to weak demand in China, affecting its automotive and electronics businesses and reducing sales growth projections.']}
5 {'Positive Events': ['JPMorgan suggests Apple should acquire Netflix, potentially leading to long-term streaming and advertising revenue upside. Netflix stock rose 1.45% in early trading after the report.', 'Ultimate Software accepts a $331.50 per share takeover offer from a Hellman Friedman-led consortium, with a 50% premium to its previous closing price.', 'Apple's French division reaches an agreement to pay undeclared back taxes estimated at around 571 million euros, potentially improving its relationship with the French tax administration.'], 'Negative Events': ['AMS, a sensor specialist supplying components for Apple's face recognition technology, warns of a more than 20% sales decline in Q1 2019 due to weak smartphone demand.', 'Two U.S. House Democrats express concern over Apple's handling of a privacy flaw in its FaceTime group video chat software, potentially damaging its reputation.', 'Analysts predict a decline of 0.1% in first quarter earnings for S&P 500 companies, potentially indicating a challenging earnings season for many companies.']}
6 {'Positive Events': ['Apple reported stronger than projected earnings, driven by increased demand for its cybersecurity and media content delivery services.', 'Apple is targeting an April event to introduce a streaming television service, likely featuring content from CBS, Viacom, and Lions Gate, along with its own original productions.', 'NVIDIA's stock price surged in premarket trade following the company's forecast for better-than-expected sales during the current fiscal year.'], 'Negative Events': ['The EU General Court annulled an order for Belgian tax break schemes worth about 700 million euros to multinational firms, potentially impacting Apple and other companies.', 'Warren Buffett's Berkshire Hathaway reduced its stake in Apple, raising concerns about investor sentiment.', 'Apple significantly ramped up its self-driving car testing but still lags behind market leader Waymo.']}
7 {'Positive Events': ['Warner Bros adopted inclusion riders policy, leading to increase in films with female leads and positive industry response.', 'Garmin reported stronger-than-expected fourth quarter earnings and revenue, causing shares to surge.', 'Apple and Goldman Sachs partnered to launch co-branded credit cards, benefiting both companies'], 'Negative Events': ['WhatsApp acknowledged security bug allowing iPhone users to bypass privacy feature, raising concerns about user data protection.', 'Kraft Heinz suffered significant loss in premarket trade due to disappointing earnings report and SEC investigation.', 'Apple's vehicle project may shift from car development to electric van, causing uncertainty about the project's direction.']}
8 {'Positive Events': ['President Trump's announcement of progress in trade talks with China and delayed tariff hikes led to gains in trade-sensitive stocks like Boeing and Caterpillar.', 'Sony unveiled its new flagship Xperia 1 at Mobile World Congress, featuring a 21:9 ratio HDR OLED screen and professional-grade camera capabilities.', 'Warren Buffett's company, Berkshire Hathaway, defended its position as a global tech leader amid Huawei's ongoing U.S.-China trade tensions.'], 'Negative Events': ['AAC Technologies Holdings reported a significant decrease in expected net profit for Q1 2019 due to reduced orders from customers and weak seasonal demand, causing its shares to plummet.', 'Apple announced plans to lay off approximately 190 employees from its self-driving car project, Project Titan.', 'Huawei faced ongoing U.S.-China trade tensions, with the company defending its position as a global tech leader amid efforts to exclude it from 5G networks.']}
9 {'Positive Events': ['Spotify reported over 1 million unique users in India within a week of launch, expanding its market reach.', 'IBM CEO Ginni Rometty and Apple CEO Tim Cook attended a White House forum discussing hiring Americans without college degrees, indicating a positive shift in the job market.', 'Chinese online retailers discounted the price of Apple's iPhone XS, potentially boosting sales in China.'], 'Negative Events': ['Mozilla considers revoking DarkMatter's authority to certify websites as safe due to reports linking the cybersecurity firm to a UAE-based intelligence agency's hacking program.', 'European shares were flat due to weak results from the auto sector and fading investor confidence, negatively impacting companies like Schaeffler.', 'Tesla faced challenges with halted Model 3 sales in China due to regulatory issues and reported a first-quarter loss.']}
10 {'Positive Events': ['Apple launched a new television advertising campaign emphasizing its commitment to data privacy, potentially boosting consumer trust and differentiating it from rivals under regulatory scrutiny.', 'Stocks rallied on Friday following reports of progress in US-China trade talks, with tech companies leading the gains and the S&P 500 and Nasdaq posting their best weekly gains since November.', 'In a preliminary ruling, U.S. District Court Judge Gonzalo Curiel ordered Qualcomm to pay nearly $1 billion in patent royalty rebate payments to Apple, potentially strengthening Apple's financial position.'], 'Negative Events': ['Boeing's NYSE BA stock experienced significant losses in premarket trade after the FAA grounded its 737 Max 8 following a deadly crash, potentially impacting the company's reputation and sales.', 'Smartphone shipments to China dropped to their lowest level in six years due to consumer hesitation amid a slowing economy and trade tensions, negatively affecting companies like Apple that rely heavily on the Chinese market.', 'The European Union's Competition Commissioner Margrethe Vestager indicated that the EU is considering opening an investigation into Apple over allegations of using its app store to favor its own services over rivals, potentially leading to regulatory action and fines.']}
11 {'Positive Events': ['Apple introduced updated AirPods headphones with improved battery life and optional wireless charging case.', 'Samsung Electronics reported strong sales for its new Galaxy flagship smartphones in China.', 'Foxconn announced completion of new factory in Wisconsin by end of 2019 to manufacture liquid crystal display screens.'], 'Negative Events': ['Facebook's stock price dropped more than 3% following downgrades from Needham and Bank of America due to privacy concerns and regulatory risks.', 'Myer, Australia's largest department store chain, announced it will stop selling Apple products due to weak demand and unprofitable sales.', 'Tesla faces a lawsuit from a former engineer for allegedly copying Autopilot source code before joining Chinese startup.']}
12 {'Positive Events': ['Apple announced new subscription services, including Apple TV+, Apple Arcade, and AppleNews, which have the potential to generate significant revenue.', 'Goldman Sachs announced a credit card partnership with Apple, expanding its consumer business and tapping into iPhone users.', 'Lyft raised the price range for its IPO due to strong investor demand, indicating high confidence in the company's growth potential.'], 'Negative Events': ['The yield curve inversion and recession fears led to a decline in U.S. stock markets, negatively impacting Apple's stock price.', 'Qualcomm's bid to block imports of certain Apple iPhones was rejected by the ITC, potentially leading to financial losses.', 'Sony announced the closure of its Beijing smartphone plant, resulting in a significant financial loss for the company.']}
13 {'Positive Events': ['Apple and other consumer brands, including Louis Vuitton and Gucci, reduced prices for their products in China following a tax rate cut, potentially boosting sales.', 'Japan Display, a key Apple supplier, will begin providing OLED screens for the Apple Watch later this year, marking its entry into the OLED market.', 'S&P 500, Dow Jones Industrial Average, and Nasdaq Composite closed higher due to optimism over US-China trade talks and surging chip stocks.'], 'Negative Events': ['Apple's NASDAQ stock decreased due to price cuts in China, potentially impacting investor confidence.', 'Samsung Electronics faces challenges including falling memory chip prices and slowing demand for display panels, potentially leading to significant earnings miss.', 'Facebook's business version of WhatsApp was launched, but the company faces new regulations and potential fines in Australia for failing to remove violent content expeditiously.']}
14 {'Positive Events': ['Apple's initiative to reduce carbon footprint led to nearly doubling the number of suppliers using clean energy, including major iPhone manufacturers.', 'Delta Airlines' Q1 earnings surpassed expectations, leading to a 2.7% increase in DAL stock.', 'Oprah Winfrey and Prince Harry partnered to create an Apple documentary aimed at promoting mental health awareness.'], 'Negative Events': ['Mobile phone shipments to China dropped by 6 percent, marking the fifth consecutive month of decline and following a 15.5 percent decrease in 2018.', 'Google raised YouTube TV's monthly membership fee by 25%, potentially impacting subscriber base and revenue.', 'Apple is under investigation by the Dutch competition agency for allegedly favoring its own apps on the App Store.']}
15 {'Positive Events': ['Apple supplier Foxconn's chairman Terry Gou announced his intention to contest the 2020 presidential election, potentially shaking up Taiwan's political landscape.', 'TomTom reported a 14% increase in first quarter revenue to €169.5 million, beating analysts' forecasts and securing two contracts to supply high definition maps to major carmakers.', 'Qualcomm saw a surge in stock price due to a patent settlement with Apple and potential earnings boost from Huawei.'], 'Negative Events': ['Apple faced a securities fraud lawsuit for allegedly concealing weakened demand for iPhones, particularly in China, leading to a significant stock price drop.', 'Taiwan Semiconductor Manufacturing Company (TSMC) reported a steep quarterly profit drop of 32% due to weak global demand for smartphones and the prolonged U.S.-China trade war.', 'Samsung Electronics reported issues with the displays of its upcoming foldable smartphone, the Galaxy Fold, raising concerns over a smooth launch.']}
16 {'Positive Events': ['Tencent Holdings invested in Argentine mobile banking service Uala, significantly raising its valuation and helping accelerate growth plans.', 'Snap reported better-than-expected earnings for Q1, driven by the popularity of its original shows and the launch of a new Android app.', 'ASM International beat first quarter expectations with strong performance in its fabrication and logic semiconductor businesses.'], 'Negative Events': ['Taiwan's export orders continued to decline for the fifth consecutive month, falling at a faster-than-expected rate of 9%.', 'Samsung Electronics delayed the retail launch of its new Galaxy Fold smartphone due to display issues reported by reviewers.', 'LG Electronics announced it would cease smartphone production in South Korea and shift manufacturing to Vietnam due to global demand slump and low market share.']}
17 {'Positive Events': ['Spotify reported better-than-expected Q1 revenue growth, reaching 100 million paid subscribers.', 'S&P 500 reached a new record high close, fueled by strong earnings reports from companies like Apple.', 'Federal Reserve anticipated to keep interest rates unchanged, but a rate cut is expected later this year. Apple's earnings report exceeded expectations, leading to a post-market surge in shares.'], 'Negative Events': ['Czech Finance Ministry plans to impose a digital tax on global internet giants, potentially impacting their profits.', 'Disappointing earnings reports from Google parent Alphabet and Samsung.', 'European shares fell on Tuesday, with major indices dropping except London's FTSE 100. Danske Bank plunged due to regulatory issues.']}

Checking for empty dictionary

In [54]:
data_1[data_1["model_response_parsed"]=={}]
Out[54]:
Date News model_response model_response_parsed

Example model parsed response

In [55]:
data_1['model_response_parsed'][4]
Out[55]:
{'Positive Events': ['Apple reported stronger-than-expected earnings for Q1 2023, with GAAP EPS coming in at $4.18 versus the estimated $4.17 and revenue surpassing expectations.',
  "CVS Health's insurer, Aetna, announced a new health app for Apple Watches, called Attain, which offers customized fitness challenges and rewards.",
  'Corning reported higher-than-expected revenue and profit for Q4, driven by increased demand from telecom companies investing in 5G networks.'],
 'Negative Events': ["Caterpillar reported lower-than-expected fourth quarter earnings and full year 2019 outlook due to weak demand in China's construction business, causing shares to fall and pulling down U.S. stock futures.",
  'Apple is expected to report lower-than-projected fiscal first quarter earnings, with revenue falling significantly due to disappointing iPhone sales in China.',
  '3M issued a revenue warning due to weak demand in China, affecting its automotive and electronics businesses and reducing sales growth projections.']}

Create a dataframe from the JSON data

In [76]:
model_response_parsed_df = pd.json_normalize(data_1['model_response_parsed'])
model_response_parsed_df.head(2)
Out[76]:
Positive Events Negative Events
0 [Roku Inc announced plans to offer premium video channels on a subscription basis through its free streaming service, The Roku Channel., FDIC Chair, Jelena McWilliams, expressed no concern over market volatility affecting the U.S banking system., Oil prices rebounded on Thursday due to dollar weakness, signs of output cuts by Saudi Arabia, and weaker fuel oil margins leading Riyadh to lower February prices for heavier crude grades sold to Asia.] [Apple cut its quarterly revenue forecast for the first time in over 15 years due to weak iPhone sales in China, representing around 20% of Apple's revenue., Apple's profit warning led to an estimated $3.8 billion paper loss for Berkshire Hathaway due to its $252 million stake in Apple., Apple's Q1 revenue guidance cut, resulting from weaker demand in China, led to significant losses for Berkshire Hathaway and a $21.4 billion decrease in its market value.]
1 [Sprint and Samsung planning 5G smartphone release in nine U.S. cities, expanding market reach, AMS developing new 3D facial recognition features for smartphones, reducing dependence on Apple, Deutsche Bank upgrades Universal Music Group valuation, identifying potential suitors] [AMS lowers revenue growth forecast due to weak demand from smartphone makers and automotive industry, Chinese smartphone market experiences decline in shipments, impacting sales for companies, European Commission launches investigation into Nike's tax treatment in the Netherlands]
In [70]:
data_with_parsed_model_output = pd.concat([data_1, model_response_parsed_df], axis=1)

Remove square brackets from Positive and Negative Events

In [71]:
import re
def remove_brackets(text):
    if isinstance(text, list):  # Check if text is a list
        text = ', '.join(text) if text else ''  # Join list elements into a string
    return re.sub(r'\[.*?\]', '', str(text))  # Convert text to string before applying re.sub
In [72]:
data_with_parsed_model_output['Positive Events'] = data_with_parsed_model_output['Positive Events'].apply(remove_brackets)
data_with_parsed_model_output['Negative Events'] = data_with_parsed_model_output['Negative Events'].apply(remove_brackets)
In [73]:
data_with_parsed_model_output.head(2)
Out[73]:
Date News model_response model_response_parsed Positive Events Negative Events
0 2019-01-06 The tech sector experienced a significant decline in the aftermarket following Apple's Q1 revenue warning. Notable suppliers, including Skyworks, Broadcom, Lumentum, Qorvo, and TSMC, saw their stocks drop in response to Apple's downward revision of its revenue expectations for the quarter, previously announced in January. || Apple lowered its fiscal Q1 revenue guidance to $84 billion from earlier estimates of $89-$93 billion due to weaker than expected iPhone sales. The announcement caused a significant drop in Apple's stock price and negatively impacted related suppliers, leading to broader market declines for tech indices such as Nasdaq 10 || Apple cut its fiscal first quarter revenue forecast from $89-$93 billion to $84 billion due to weaker demand in China and fewer iPhone upgrades. CEO Tim Cook also mentioned constrained sales of Airpods and Macbooks. Apple's shares fell 8.5% in post market trading, while Asian suppliers like Hon || This news article reports that yields on long-dated U.S. Treasury securities hit their lowest levels in nearly a year on January 2, 2019, due to concerns about the health of the global economy following weak economic data from China and Europe, as well as the partial U.S. government shutdown. Apple || Apple's revenue warning led to a decline in USD JPY pair and a gain in Japanese yen, as investors sought safety in the highly liquid currency. Apple's underperformance in Q1, with forecasted revenue of $84 billion compared to analyst expectations of $91.5 billion, triggered risk aversion mood in markets || Apple CEO Tim Cook discussed the company's Q1 warning on CNBC, attributing US-China trade tensions as a factor. Despite not mentioning iPhone unit sales specifically, Cook indicated Apple may comment on them again. Services revenue is projected to exceed $10.8 billion in Q1. Cook also addressed the lack of || Roku Inc has announced plans to offer premium video channels on a subscription basis through its free streaming service, The Roku Channel. Partners include CBS Corp's Showtime, Lionsgate's Starz, and Viacom Inc's Noggin. This model follows Amazon's successful Channels business, which generated an estimated || Wall Street saw modest gains on Wednesday but were threatened by fears of a global economic slowdown following Apple's shocking revenue forecast cut, blaming weak demand in China. The tech giant's suppliers and S&P 500 futures also suffered losses. Reports of decelerating factory activity in China and the euro zone || Apple's fiscal first quarter revenue came in below analysts' estimates at around $84 billion, a significant drop from the forecasted range of $89-$93 billion. The tech giant attributed the shortfall to lower iPhone revenue and upgrades, as well as weakness in emerging markets. Several brokerages had already reduced their production estimates || Apple Inc. lowered its quarterly sales forecast for the fiscal first quarter, underperforming analysts' expectations due to slowing Chinese economy and trade tensions. The news sent Apple shares tumbling and affected Asia-listed suppliers like Hon Hai Precision Industry Co Ltd, Taiwan Semiconductor Manufacturing Company, and LG Innot || The Australian dollar experienced significant volatility on Thursday, plunging to multi-year lows against major currencies due to automated selling, liquidity issues, and a drought of trades. The largest intra-day falls in the Aussie's history occurred amid violent movements in AUD/JPY and AUD/ || In early Asian trading on Thursday, the Japanese yen surged as the U.S. dollar and Australian dollar collapsed in thin markets due to massive stop loss sales triggered by Apple's earnings warning of sluggish iPhone sales in China and risk aversion. The yen reached its lowest levels against the U.S. dollar since March || The dollar fell from above 109 to 106.67 after Apple's revenue warning, while the 10-year Treasury yield also dropped to 2.61%. This followed money flowing into US government paper. Apple's shares and U.S. stock index futures declined, with the NAS || RBC Capital maintains its bullish stance on Apple, keeping its Outperform rating and $220 price target. However, analyst Amit Daryanani warns of ongoing iPhone demand concerns, which could impact pricing power and segmentation efforts if severe. He suggests potential capital allocation adjustments if the stock underperforms for several quarters || Oil prices dropped on Thursday as investor sentiment remained affected by China's economic slowdown and turmoil in stock and currency markets. US WTI Crude Oil fell by $2.10 to $45.56 a barrel, while International Brent Oil was down $1.20 at $54.26 || In this news article, investors' concerns about a slowing Chinese and global economy, amplified by Apple's revenue warning, led to a significant surge in the Japanese yen. The yen reached its biggest one-day rise in 20 months, with gains of over 4% versus the dollar. This trend was driven by automated || In Asia, gold prices rose to over six-month highs on concerns of a global economic slowdown and stock market volatility. Apple lowered its revenue forecast for the first quarter, leading Asian stocks to decline and safe haven assets like gold and Japanese yen to gain. Data showed weakened factory activity in Asia, particularly China, adding to || Fears of a global economic slowdown led to a decline in the US dollar on Thursday, as the yen gained ground due to its status as a safe haven currency. The USD index slipped below 96, and USD JPY dropped to 107.61, while the yen strengthened by 4.4%. || In Thursday trading, long-term US Treasury yields dropped significantly below 2.6%, reaching levels not seen in over a year, as investors shifted funds from stocks to bonds following Apple's warning of decreased revenue due to emerging markets and China's impact on corporate profits, with the White House advisor adding to concerns of earnings down || Gold prices have reached their highest level since mid-June, with the yellow metal hitting $1,291.40 per ounce due to investor concerns over a slowing economy and Apple's bearish revenue outlook. Saxo Bank analyst Ole Hansen predicts gold may reach $1,300 sooner || Wedbush analyst Daniel Ives lowered his price target for Apple from $275 to $200 due to concerns over potential iPhone sales stagnation, with an estimated 750 million active iPhones worldwide that could cease growing or even decline. He maintains an Outperform rating and remains bullish on the long || Oil prices rebounded on Thursday due to dollar weakness, signs of output cuts by Saudi Arabia, and weaker fuel oil margins leading Riyadh to lower February prices for heavier crude grades sold to Asia. The Organization of the Petroleum Exporting Countries (OPEC) led by Saudi Arabia and other producers || This news article reports on the impact of Apple's Q1 revenue warning on several tech and biotech stocks. Sesen Bio (SESN) and Prana Biotechnology (PRAN) saw their stock prices drop by 28% and 11%, respectively, following the announcement. Mellanox Technologies (ML || Gold prices reached within $5 of $1,300 on Thursday as weak stock markets and a slumping dollar drove investors towards safe-haven assets. The U.S. stock market fell about 2%, with Apple's rare profit warning adding to investor unease. COMEX gold futures settled at $1 || The FDIC Chair, Jelena McWilliams, expressed no concern over market volatility affecting the U.S banking system due to banks' ample capital. She also mentioned a review of the CAMELS rating system used to evaluate bank health for potential inconsistencies and concerns regarding forum shopping. This review comes from industry || Apple cut its quarterly revenue forecast for the first time in over 15 years due to weak iPhone sales in China, representing around 20% of Apple's revenue. This marks a significant downturn during Tim Cook's tenure and reflects broader economic concerns in China exacerbated by trade tensions with the US. U || Goldman analyst Rod Hall lowered his price target for Apple from $182 to $140, citing potential risks to the tech giant's 2019 numbers due to uncertainties in Chinese demand. He reduced his revenue estimate for the year by $6 billion and EPS forecast by $1.54 || Delta Air Lines lowered its fourth-quarter revenue growth forecast to a range of 3% from the previous estimate of 3% to 5%. Earnings per share are now expected to be $1.25 to $1.30. The slower pace of improvement in late December was unexpected, and Delta cited this as || Apple's profit warning has significantly impacted the stock market and changed the outlook for interest rates. The chance of a rate cut in May has increased to 15-16% from just 3%, according to Investing com's Fed Rate Monitor Tool. There is even a 1% chance of two cuts in May. || The White House advisor, Kevin Hassett, stated that a decline in Chinese economic growth would negatively impact U.S. firm profits but recover once a trade deal is reached between Washington and Beijing. He also noted that Asian economies, including China, have been experiencing significant slowdowns since last spring due to U.S. tariffs || The White House economic adviser, Kevin Hassett, warned that more companies could face earnings downgrades due to ongoing trade negotiations between the U.S. and China, leading to a decline in oil prices on Thursday. WTI crude fell 44 cents to $44.97 a barrel, while Brent crude inched || Japanese stocks suffered significant losses on the first trading day of 2019, with the Nikkei 225 and Topix indices both falling over 3 percent. Apple's revenue forecast cut, citing weak iPhone sales in China, triggered global growth concerns and sent technology shares tumbling. The S&P 50 || Investors withdrew a record $98 billion from U.S. stock funds in December, with fears of aggressive monetary policy and an economic slowdown driving risk reduction. The S&P 500 fell 9% last month, with some seeing declines as a buying opportunity. Apple's warning of weak iPhone sales added || Apple's Q1 revenue guidance cut, resulting from weaker demand in China, led to an estimated $3.8 billion paper loss for Berkshire Hathaway due to its $252 million stake in Apple. This news, coupled with broad market declines, caused a significant $21.4 billion decrease in Berk || This news article reports that a cybersecurity researcher, Wish Wu, planned to present at the Black Hat Asia hacking conference on how to bypass Apple's Face ID biometric security on iPhones. However, his employer, Ant Financial, which operates Alipay and uses facial recognition technologies including Face ID, asked him to withdraw || OPEC's production cuts faced uncertainty as oil prices were influenced by volatile stock markets, specifically due to Apple's lowered revenue forecast and global economic slowdown fears. US WTI and Brent crude both saw gains, but these were checked by stock market declines. Shale production is expected to continue impacting the oil market in || Warren Buffett's Berkshire Hathaway suffered significant losses in the fourth quarter due to declines in Apple, its largest common stock investment. Apple cut its revenue forecast, causing a 5-6% decrease in Berkshire's Class A shares. The decline resulted in potential unrealized investment losses and could push Berk || This news article reports that on Thursday, the two-year Treasury note yield dropped below the Federal Reserve's effective rate for the first time since 2008. The market move suggests investors believe the Fed will not be able to continue tightening monetary policy. The drop in yields was attributed to a significant decline in U.S || The U.S. and China will hold their first face-to-face trade talks since agreeing to a 90-day truce in their trade war last month. Deputy U.S. Trade Representative Jeffrey Gerrish will lead the U.S. delegation for negotiations on Jan. 7 and 8, || Investors bought gold in large quantities due to concerns over a global economic slowdown, increased uncertainty in the stock market, and potential Fed rate hikes. The precious metal reached its highest price since June, with gold ETF holdings also seeing significant increases. Factors contributing to this demand include economic downturn, central bank policy mistakes, and || Delta Air Lines Inc reported lower-than-expected fourth quarter unit revenue growth, citing weaker than anticipated late bookings and increased competition. The carrier now expects total revenue per available seat mile to rise about 3 percent in the period, down from its earlier forecast of 3.5 percent growth. Fuel prices are also expected to || U.S. stocks experienced significant declines on Thursday as the S&P 500 dropped over 2%, the Dow Jones Industrial Average fell nearly 3%, and the Nasdaq Composite lost approximately 3% following a warning of weak revenue from Apple and indications of slowing U.S. factory activity, raising concerns || President Trump expressed optimism over potential trade talks with China, citing China's current economic weakness as a potential advantage for the US. This sentiment was echoed by recent reports of weakened demand for Apple iPhones in China, raising concerns about the overall health of the Chinese economy. The White House is expected to take a strong stance in || Qualcomm secured a court order in Germany banning the sale of some iPhone models due to patent infringement, leading Apple to potentially remove these devices from its stores. However, third-party resellers like Gravis continue selling the affected iPhones. This is the third major effort by Qualcomm to ban Apple's iPhones glob || Oil prices rose on Friday in Asia as China confirmed trade talks with the U.S., with WTI gaining 0.7% to $47.48 and Brent increasing 0.7% to $56.38 a barrel. The gains came after China's Commerce Ministry announced that deputy U.S. Trade || Gold prices surged past the psychologically significant level of $1,300 per ounce in Asia on Friday due to growing concerns over a potential global economic downturn. The rise in gold was attributed to weak PMI data from China and Apple's reduced quarterly sales forecast. Investors viewed gold as a safe haven asset amidst || In an internal memo, Huawei's Chen Lifang reprimanded two employees for sending a New Year greeting on the company's official Twitter account using an iPhone instead of a Huawei device. The incident caused damage to the brand and was described as a "blunder" in the memo. The mistake occurred due to || This news article reports on the positive impact of trade war talks between Beijing and Washington on European stock markets, specifically sectors sensitive to the trade war such as carmakers, industrials, mining companies, and banking. Stocks rallied with mining companies leading the gains due to copper price recovery. Bayer shares climbed despite a potential ruling restricting || Amazon has sold over 100 million devices with its Alexa digital assistant, according to The Verge. The company is cautious about releasing hardware sales figures and did not disclose holiday numbers for the Echo Dot. Over 150 products feature Alexa, and more than 28,000 smart home || The Supreme Court will review Broadcom's appeal in a shareholder lawsuit over the 2015 acquisition of Emulex. The case hinges on whether intent to defraud is required for such lawsuits, and the decision could extend beyond the Broadcom suit. An Emulex investor filed a class action lawsuit || The Chinese central bank announced a fifth reduction in the required reserve ratio (RRR) for banks, freeing up approximately 116.5 billion yuan for new lending. This follows mounting concerns about China's economic health amid slowing domestic demand and U.S. tariffs on exports. Premier Li Keqiang || The stock market rebounded strongly on Friday following positive news about US-China trade talks, a better-than-expected jobs report, and dovish comments from Federal Reserve Chairman Jerome Powell. The Dow Jones Industrial Average rose over 746 points, with the S&P 500 and Nasdaq Com {\n "Positive Events": [\n "Roku Inc announced plans to offer premium video channels on a subscription basis through its free streaming service, The Roku Channel.",\n "FDIC Chair, Jelena McWilliams, expressed no concern over market volatility affecting the U.S banking system.",\n "Oil prices rebounded on Thursday due to dollar weakness, signs of output cuts by Saudi Arabia, and weaker fuel oil margins leading Riyadh to lower February prices for heavier crude grades sold to Asia."\n ],\n "Negative Events": [\n "Apple cut its quarterly revenue forecast for the first time in over 15 years due to weak iPhone sales in China, representing around 20% of Apple's revenue.",\n "Apple's profit warning led to an estimated $3.8 billion paper loss for Berkshire Hathaway due to its $252 million stake in Apple.",\n "Apple's Q1 revenue guidance cut, resulting from weaker demand in China, led to significant losses for Berkshire Hathaway and a $21.4 billion decrease in its market value."\n ]\n }\n\n This JSON output summarizes the top three positive and negative events based on their significance and potential impact on Apple's stock price following the analysis of the provided news articles. The positive events include Roku's announcement of premium video channels, the FDIC Chair's reassuring statement about market volatility, and oil price rebound due to various factors. The negative events include Apple's historic revenue warning, Berkshire Hathaway's significant losses due to Apple's stock decline, and the substantial decrease in Berkshire Hathaway's market value as a result. {'Positive Events': ['Roku Inc announced plans to offer premium video channels on a subscription basis through its free streaming service, The Roku Channel.', 'FDIC Chair, Jelena McWilliams, expressed no concern over market volatility affecting the U.S banking system.', 'Oil prices rebounded on Thursday due to dollar weakness, signs of output cuts by Saudi Arabia, and weaker fuel oil margins leading Riyadh to lower February prices for heavier crude grades sold to Asia.'], 'Negative Events': ['Apple cut its quarterly revenue forecast for the first time in over 15 years due to weak iPhone sales in China, representing around 20% of Apple's revenue.', 'Apple's profit warning led to an estimated $3.8 billion paper loss for Berkshire Hathaway due to its $252 million stake in Apple.', 'Apple's Q1 revenue guidance cut, resulting from weaker demand in China, led to significant losses for Berkshire Hathaway and a $21.4 billion decrease in its market value.']} Roku Inc announced plans to offer premium video channels on a subscription basis through its free streaming service, The Roku Channel., FDIC Chair, Jelena McWilliams, expressed no concern over market volatility affecting the U.S banking system., Oil prices rebounded on Thursday due to dollar weakness, signs of output cuts by Saudi Arabia, and weaker fuel oil margins leading Riyadh to lower February prices for heavier crude grades sold to Asia. Apple cut its quarterly revenue forecast for the first time in over 15 years due to weak iPhone sales in China, representing around 20% of Apple's revenue., Apple's profit warning led to an estimated $3.8 billion paper loss for Berkshire Hathaway due to its $252 million stake in Apple., Apple's Q1 revenue guidance cut, resulting from weaker demand in China, led to significant losses for Berkshire Hathaway and a $21.4 billion decrease in its market value.
1 2019-01-13 Sprint and Samsung plan to release 5G smartphones in nine U.S. cities this summer, with Atlanta, Chicago, Dallas, Houston, Kansas City, Los Angeles, New York City, Phoenix, and Washington D.C. being the initial locations. Rival Verizon also announced similar plans for the first half of 20 || AMS, an Austrian tech company listed in Switzerland and a major supplier to Apple, has developed a light and infrared proximity sensor that can be placed behind a smartphone's screen. This allows for a larger display area by reducing the required space for sensors. AMS provides optical sensors for 3D facial recognition features on Apple || Deutsche Bank upgraded Vivendi's Universal Music Group valuation from €20 billion to €29 billion, surpassing the market cap of Vivendi at €28.3 billion. The bank anticipates music streaming revenue to reach €21 billion in 2023 and identifies potential suitors for || Amazon's stock is predicted to surge by over 20% by the end of this year, according to a new report from Pivotal Research. Senior analyst Brian Wieser initiated coverage on the stock with a buy rating and a year-end price target of $1,920. The growth potential for Amazon lies primarily in || AMS, an Austrian sensor specialist, is partnering with Chinese software maker Face to develop new 3D facial recognition features for smartphones. This move comes as AMS aims to reduce its dependence on Apple and boost its battered shares. AMS provides optical sensors for Apple's 3D facial recognition feature on iPhones, || Geely, China's most successful carmaker, forecasts flat sales for 2019 due to economic slowdown and cautious consumers. In 2018, it posted a 20% sales growth, but missed its target of 1.58 million cars by around 5%. Sales dropped 44 || China is making sincere efforts to address U.S. concerns and resolve the ongoing trade war, including lowering taxes on automobile imports and implementing a law banning forced technology transfers. However, Beijing cannot and should not dismantle its governance model as some in Trump's administration have demanded. Former Goldman Sachs China || Stock index futures indicate a slightly lower open on Wall Street Monday, as investors remain cautious amid lack of progress in U.S.-China trade talks and political risks from the ongoing government shutdown. Dow futures were flat, S&P 500 dipped 0.13%, while Nasdaq 10 || Qualcomm, a leading chipmaker, has announced an expansion of its lineup of car computing chips into three tiers - entry-level, Performance, Premiere, and Paramount. This move is aimed at catering to various price points in the automotive market, similar to its smartphone offerings. The company has reported a backlog || The stock market showed minimal changes at the open as investors await trade talks progress between the U.S. and China. The S&P 500 dropped 0.04%, Dow lost 0.23%, but Nasdaq gained 0.2%. The ISM services index, expected to be released at 1 || The article suggests that some economists believe the US economy may have reached its peak growth rate, making the euro a potentially bullish investment. The EUR/USD exchange rate has held steady despite weak Eurozone data due to dollar weakness and stagnant interest rate expectations in Europe. However, concerns over economic growth are emerging due to sell || The Chinese smartphone market, the world's largest, saw a decline of 12-15.5 percent in shipments last year with December experiencing a 17 percent slump, according to China Academy of Information and Communications Technology (CAICT) and market research firm Canalys. This follows a 4 percent drop in ship || Austrian tech firm AT S lowered its revenue growth forecast for 2018/19 due to weak demand from smartphone makers and the automotive industry. The company now anticipates a 3% increase in sales from last year's €991.8 million, down from its previous projection of a 6- || The stock markets in Asia surged during morning trade on Wednesday, following reports of progress in U. S - China trade talks. Negotiators extended talks for a third day and reportedly made strides on purchases of U. S goods and services. However, structural issues such as intellectual property rights remain unresolved. President Trump is eager to strike || Mercedes Benz sold over 2.31 million passenger cars in 2018, making it the top selling premium automotive brand for the third year in a row. However, analysts question how long German manufacturers can dominate the luxury car industry due to the shift towards electric and self-driving cars. Tesla, with || The S&P 500 reached a three-week high on Tuesday, driven by gains in Apple, Amazon, Facebook, and industrial shares. Investors are optimistic about a potential deal between the US and China to end their trade war. The S&P 500 has rallied over 9% since late December || The stock market continued its rally on Tuesday, with the Dow Jones Industrial Average, S&P 500, and Nasdaq Composite all posting gains. Optimism over progress in trade talks between the US and China was a major contributor to the market's positive sentiment, with reports suggesting that both parties are narrowing their differences ahead || Roku's stock dropped by 5% on Tuesday following Citron Research's reversal of its long position, labeling the company as uninvestable. This change in stance came after Apple announced partnerships with Samsung to offer iTunes services on some Samsung TVs, potentially impacting Roku's user base growth. || The Chinese authorities are expected to release a statement following the conclusion of U. S. trade talks in Beijing, with both sides signaling progress toward resolving the conflict that has roiled markets. Chinese Vice Premier Liu He, who is also the chief economic adviser to Chinese President Xi Jinping, made an appearance at the negotiations and is || Xiaomi Co-founder Lei Jun remains optimistic about the future of his smartphone company despite a recent share slump that erased $6 billion in market value. The Chinese tech firm is shifting its focus to the high end and expanding into Europe, while shunning the US market. Xiaomi aims to elevate its Red || The European Commission has launched an investigation into Nike's tax treatment in the Netherlands, expressing concerns that the company may have received an unfair advantage through royalty payment structures. The EU executive has previously probed tax schemes in Belgium, Gibraltar, Luxembourg, Ireland, and the Netherlands, with countries ordered to recover taxes from benefici || Taiwan's Foxconn, a major Apple supplier, reported an 8.3% decline in December revenue to TWD 619.3 billion ($20.1 billion), marking its first monthly revenue dip since February. The fall was due to weak demand for consumer electronics. In 2018, Foxconn || Starting tomorrow, JD.com will offer reduced prices on some Apple iPhone 8 and 8 Plus models by approximately 600 yuan and 800 yuan respectively. These price drops, amounting to a savings of around $92-$130 per unit, are in line with earlier rumors suggesting price redu || Cummins, Inc. (CMI) announced that Pat Ward, its long-term Chief Financial Officer (CFO), will retire after 31 years of service on March 31, 2019. Mark Smith, who has served as Vice President of Financial Operations since 2014, will succeed Ward, || The Federal Reserve Chairman, Jerome Powell, maintained his patient stance on monetary policy but raised concerns about the balance sheet reduction. He indicated that the Fed's balance sheet would be substantially smaller, indicating the continuation of the balance sheet wind down operation. Despite this, Powell reassured investors of a slower pace on interest rate h || Wall Street experienced a decline after the opening bell on Friday, following five consecutive days of gains. The S&P 500 dropped 13 points or 0.54%, with the Dow decreasing 128 points or 0.54%, and the Nasdaq Composite losing 37 points or || Several Chinese retailers, including Alibaba-backed Suning and JD.com, have drastically reduced iPhone prices due to weak sales in China, which prompted Apple's recent revenue warning. Discounts for the latest XR model range from 800 to 1,200 yuan. These price || Green Dot, GDOT, is a bank holding company with a wide distribution network and impressive growth. Its product offerings include bank accounts, debit and credit cards, with a focus on perks. The firm's platform business, "banking as a service," powers offerings for partners such as Apple Pay Cash, Walmart Money || US stock futures declined on Friday as disappointing holiday sales and revenue cuts from various companies raised concerns about a potential recession. The S&P 500, Dow Jones Industrial Average, and Nasdaq 100 fell, with the Fed's possible policy pause and optimistic trade talks failing to offset these negative factors. || Apple's NASDAQ AAPL stock declined by 0.52% in premarket trade Friday due to price cuts of iPhone models in China, but the company is set to launch three new iPhone models this year. Johnson & Johnson's NYSE JNJ stock edged forward after raising prescription drug prices. Starbucks || Apple is reportedly set to release three new iPhone models this year, featuring new camera setups including triple rear cameras for the premium model and dual cameras for the others. The move comes after weak sales, particularly in China, led retailers to cut prices on the XR model. Amid sluggish sales, Apple opted to stick with {\n "Positive Events": [\n "Sprint and Samsung planning 5G smartphone release in nine U.S. cities, expanding market reach",\n "AMS developing new 3D facial recognition features for smartphones, reducing dependence on Apple",\n "Deutsche Bank upgrades Universal Music Group valuation, identifying potential suitors"\n ],\n "Negative Events": [\n "AMS lowers revenue growth forecast due to weak demand from smartphone makers and automotive industry",\n "Chinese smartphone market experiences decline in shipments, impacting sales for companies",\n "European Commission launches investigation into Nike's tax treatment in the Netherlands"\n ]\n }\n\n Explanation:\n\n Positive Events:\n 1. Sprint and Samsung's 5G smartphone release in nine U.S. cities expands their market reach and could potentially attract more customers, positively impacting their stock prices.\n 2. AMS's development of new 3D facial recognition features for smartphones reduces its dependence on Apple as a major customer and could lead to new business opportunities, potentially boosting its stock price.\n 3. Deutsche Bank's upgrade of Universal Music Group's valuation and identification of potential suitors indicates strong financial performance and potential for mergers or acquisitions, positively impacting Vivendi's stock price.\n\n Negative Events:\n 1. AMS's lowering of its revenue growth forecast due to weak demand from smartphone makers and the automotive industry indicates potential financial struggles and could negatively impact its stock price.\n 2. The Chinese smartphone market's decline in shipments indicates weak sales for companies in the industry, potentially negatively impacting their stock prices.\n 3. The European Commission's investigation into Nike's tax treatment in the Netherlands raises concerns about potential legal issues and could negatively impact its stock price. {'Positive Events': ['Sprint and Samsung planning 5G smartphone release in nine U.S. cities, expanding market reach', 'AMS developing new 3D facial recognition features for smartphones, reducing dependence on Apple', 'Deutsche Bank upgrades Universal Music Group valuation, identifying potential suitors'], 'Negative Events': ['AMS lowers revenue growth forecast due to weak demand from smartphone makers and automotive industry', 'Chinese smartphone market experiences decline in shipments, impacting sales for companies', 'European Commission launches investigation into Nike's tax treatment in the Netherlands']} Sprint and Samsung planning 5G smartphone release in nine U.S. cities, expanding market reach, AMS developing new 3D facial recognition features for smartphones, reducing dependence on Apple, Deutsche Bank upgrades Universal Music Group valuation, identifying potential suitors AMS lowers revenue growth forecast due to weak demand from smartphone makers and automotive industry, Chinese smartphone market experiences decline in shipments, impacting sales for companies, European Commission launches investigation into Nike's tax treatment in the Netherlands
In [74]:
#remove model_response and model_response_parsed
final_summary = data_with_parsed_model_output.drop(['model_response', 'model_response_parsed'], axis=1)

Display the Final Summary[ Date, News, Positive Events and Negative Events summary]

In [77]:
final_summary.head(2)
Out[77]:
Date News Positive Events Negative Events
0 2019-01-06 The tech sector experienced a significant decline in the aftermarket following Apple's Q1 revenue warning. Notable suppliers, including Skyworks, Broadcom, Lumentum, Qorvo, and TSMC, saw their stocks drop in response to Apple's downward revision of its revenue expectations for the quarter, previously announced in January. || Apple lowered its fiscal Q1 revenue guidance to $84 billion from earlier estimates of $89-$93 billion due to weaker than expected iPhone sales. The announcement caused a significant drop in Apple's stock price and negatively impacted related suppliers, leading to broader market declines for tech indices such as Nasdaq 10 || Apple cut its fiscal first quarter revenue forecast from $89-$93 billion to $84 billion due to weaker demand in China and fewer iPhone upgrades. CEO Tim Cook also mentioned constrained sales of Airpods and Macbooks. Apple's shares fell 8.5% in post market trading, while Asian suppliers like Hon || This news article reports that yields on long-dated U.S. Treasury securities hit their lowest levels in nearly a year on January 2, 2019, due to concerns about the health of the global economy following weak economic data from China and Europe, as well as the partial U.S. government shutdown. Apple || Apple's revenue warning led to a decline in USD JPY pair and a gain in Japanese yen, as investors sought safety in the highly liquid currency. Apple's underperformance in Q1, with forecasted revenue of $84 billion compared to analyst expectations of $91.5 billion, triggered risk aversion mood in markets || Apple CEO Tim Cook discussed the company's Q1 warning on CNBC, attributing US-China trade tensions as a factor. Despite not mentioning iPhone unit sales specifically, Cook indicated Apple may comment on them again. Services revenue is projected to exceed $10.8 billion in Q1. Cook also addressed the lack of || Roku Inc has announced plans to offer premium video channels on a subscription basis through its free streaming service, The Roku Channel. Partners include CBS Corp's Showtime, Lionsgate's Starz, and Viacom Inc's Noggin. This model follows Amazon's successful Channels business, which generated an estimated || Wall Street saw modest gains on Wednesday but were threatened by fears of a global economic slowdown following Apple's shocking revenue forecast cut, blaming weak demand in China. The tech giant's suppliers and S&P 500 futures also suffered losses. Reports of decelerating factory activity in China and the euro zone || Apple's fiscal first quarter revenue came in below analysts' estimates at around $84 billion, a significant drop from the forecasted range of $89-$93 billion. The tech giant attributed the shortfall to lower iPhone revenue and upgrades, as well as weakness in emerging markets. Several brokerages had already reduced their production estimates || Apple Inc. lowered its quarterly sales forecast for the fiscal first quarter, underperforming analysts' expectations due to slowing Chinese economy and trade tensions. The news sent Apple shares tumbling and affected Asia-listed suppliers like Hon Hai Precision Industry Co Ltd, Taiwan Semiconductor Manufacturing Company, and LG Innot || The Australian dollar experienced significant volatility on Thursday, plunging to multi-year lows against major currencies due to automated selling, liquidity issues, and a drought of trades. The largest intra-day falls in the Aussie's history occurred amid violent movements in AUD/JPY and AUD/ || In early Asian trading on Thursday, the Japanese yen surged as the U.S. dollar and Australian dollar collapsed in thin markets due to massive stop loss sales triggered by Apple's earnings warning of sluggish iPhone sales in China and risk aversion. The yen reached its lowest levels against the U.S. dollar since March || The dollar fell from above 109 to 106.67 after Apple's revenue warning, while the 10-year Treasury yield also dropped to 2.61%. This followed money flowing into US government paper. Apple's shares and U.S. stock index futures declined, with the NAS || RBC Capital maintains its bullish stance on Apple, keeping its Outperform rating and $220 price target. However, analyst Amit Daryanani warns of ongoing iPhone demand concerns, which could impact pricing power and segmentation efforts if severe. He suggests potential capital allocation adjustments if the stock underperforms for several quarters || Oil prices dropped on Thursday as investor sentiment remained affected by China's economic slowdown and turmoil in stock and currency markets. US WTI Crude Oil fell by $2.10 to $45.56 a barrel, while International Brent Oil was down $1.20 at $54.26 || In this news article, investors' concerns about a slowing Chinese and global economy, amplified by Apple's revenue warning, led to a significant surge in the Japanese yen. The yen reached its biggest one-day rise in 20 months, with gains of over 4% versus the dollar. This trend was driven by automated || In Asia, gold prices rose to over six-month highs on concerns of a global economic slowdown and stock market volatility. Apple lowered its revenue forecast for the first quarter, leading Asian stocks to decline and safe haven assets like gold and Japanese yen to gain. Data showed weakened factory activity in Asia, particularly China, adding to || Fears of a global economic slowdown led to a decline in the US dollar on Thursday, as the yen gained ground due to its status as a safe haven currency. The USD index slipped below 96, and USD JPY dropped to 107.61, while the yen strengthened by 4.4%. || In Thursday trading, long-term US Treasury yields dropped significantly below 2.6%, reaching levels not seen in over a year, as investors shifted funds from stocks to bonds following Apple's warning of decreased revenue due to emerging markets and China's impact on corporate profits, with the White House advisor adding to concerns of earnings down || Gold prices have reached their highest level since mid-June, with the yellow metal hitting $1,291.40 per ounce due to investor concerns over a slowing economy and Apple's bearish revenue outlook. Saxo Bank analyst Ole Hansen predicts gold may reach $1,300 sooner || Wedbush analyst Daniel Ives lowered his price target for Apple from $275 to $200 due to concerns over potential iPhone sales stagnation, with an estimated 750 million active iPhones worldwide that could cease growing or even decline. He maintains an Outperform rating and remains bullish on the long || Oil prices rebounded on Thursday due to dollar weakness, signs of output cuts by Saudi Arabia, and weaker fuel oil margins leading Riyadh to lower February prices for heavier crude grades sold to Asia. The Organization of the Petroleum Exporting Countries (OPEC) led by Saudi Arabia and other producers || This news article reports on the impact of Apple's Q1 revenue warning on several tech and biotech stocks. Sesen Bio (SESN) and Prana Biotechnology (PRAN) saw their stock prices drop by 28% and 11%, respectively, following the announcement. Mellanox Technologies (ML || Gold prices reached within $5 of $1,300 on Thursday as weak stock markets and a slumping dollar drove investors towards safe-haven assets. The U.S. stock market fell about 2%, with Apple's rare profit warning adding to investor unease. COMEX gold futures settled at $1 || The FDIC Chair, Jelena McWilliams, expressed no concern over market volatility affecting the U.S banking system due to banks' ample capital. She also mentioned a review of the CAMELS rating system used to evaluate bank health for potential inconsistencies and concerns regarding forum shopping. This review comes from industry || Apple cut its quarterly revenue forecast for the first time in over 15 years due to weak iPhone sales in China, representing around 20% of Apple's revenue. This marks a significant downturn during Tim Cook's tenure and reflects broader economic concerns in China exacerbated by trade tensions with the US. U || Goldman analyst Rod Hall lowered his price target for Apple from $182 to $140, citing potential risks to the tech giant's 2019 numbers due to uncertainties in Chinese demand. He reduced his revenue estimate for the year by $6 billion and EPS forecast by $1.54 || Delta Air Lines lowered its fourth-quarter revenue growth forecast to a range of 3% from the previous estimate of 3% to 5%. Earnings per share are now expected to be $1.25 to $1.30. The slower pace of improvement in late December was unexpected, and Delta cited this as || Apple's profit warning has significantly impacted the stock market and changed the outlook for interest rates. The chance of a rate cut in May has increased to 15-16% from just 3%, according to Investing com's Fed Rate Monitor Tool. There is even a 1% chance of two cuts in May. || The White House advisor, Kevin Hassett, stated that a decline in Chinese economic growth would negatively impact U.S. firm profits but recover once a trade deal is reached between Washington and Beijing. He also noted that Asian economies, including China, have been experiencing significant slowdowns since last spring due to U.S. tariffs || The White House economic adviser, Kevin Hassett, warned that more companies could face earnings downgrades due to ongoing trade negotiations between the U.S. and China, leading to a decline in oil prices on Thursday. WTI crude fell 44 cents to $44.97 a barrel, while Brent crude inched || Japanese stocks suffered significant losses on the first trading day of 2019, with the Nikkei 225 and Topix indices both falling over 3 percent. Apple's revenue forecast cut, citing weak iPhone sales in China, triggered global growth concerns and sent technology shares tumbling. The S&P 50 || Investors withdrew a record $98 billion from U.S. stock funds in December, with fears of aggressive monetary policy and an economic slowdown driving risk reduction. The S&P 500 fell 9% last month, with some seeing declines as a buying opportunity. Apple's warning of weak iPhone sales added || Apple's Q1 revenue guidance cut, resulting from weaker demand in China, led to an estimated $3.8 billion paper loss for Berkshire Hathaway due to its $252 million stake in Apple. This news, coupled with broad market declines, caused a significant $21.4 billion decrease in Berk || This news article reports that a cybersecurity researcher, Wish Wu, planned to present at the Black Hat Asia hacking conference on how to bypass Apple's Face ID biometric security on iPhones. However, his employer, Ant Financial, which operates Alipay and uses facial recognition technologies including Face ID, asked him to withdraw || OPEC's production cuts faced uncertainty as oil prices were influenced by volatile stock markets, specifically due to Apple's lowered revenue forecast and global economic slowdown fears. US WTI and Brent crude both saw gains, but these were checked by stock market declines. Shale production is expected to continue impacting the oil market in || Warren Buffett's Berkshire Hathaway suffered significant losses in the fourth quarter due to declines in Apple, its largest common stock investment. Apple cut its revenue forecast, causing a 5-6% decrease in Berkshire's Class A shares. The decline resulted in potential unrealized investment losses and could push Berk || This news article reports that on Thursday, the two-year Treasury note yield dropped below the Federal Reserve's effective rate for the first time since 2008. The market move suggests investors believe the Fed will not be able to continue tightening monetary policy. The drop in yields was attributed to a significant decline in U.S || The U.S. and China will hold their first face-to-face trade talks since agreeing to a 90-day truce in their trade war last month. Deputy U.S. Trade Representative Jeffrey Gerrish will lead the U.S. delegation for negotiations on Jan. 7 and 8, || Investors bought gold in large quantities due to concerns over a global economic slowdown, increased uncertainty in the stock market, and potential Fed rate hikes. The precious metal reached its highest price since June, with gold ETF holdings also seeing significant increases. Factors contributing to this demand include economic downturn, central bank policy mistakes, and || Delta Air Lines Inc reported lower-than-expected fourth quarter unit revenue growth, citing weaker than anticipated late bookings and increased competition. The carrier now expects total revenue per available seat mile to rise about 3 percent in the period, down from its earlier forecast of 3.5 percent growth. Fuel prices are also expected to || U.S. stocks experienced significant declines on Thursday as the S&P 500 dropped over 2%, the Dow Jones Industrial Average fell nearly 3%, and the Nasdaq Composite lost approximately 3% following a warning of weak revenue from Apple and indications of slowing U.S. factory activity, raising concerns || President Trump expressed optimism over potential trade talks with China, citing China's current economic weakness as a potential advantage for the US. This sentiment was echoed by recent reports of weakened demand for Apple iPhones in China, raising concerns about the overall health of the Chinese economy. The White House is expected to take a strong stance in || Qualcomm secured a court order in Germany banning the sale of some iPhone models due to patent infringement, leading Apple to potentially remove these devices from its stores. However, third-party resellers like Gravis continue selling the affected iPhones. This is the third major effort by Qualcomm to ban Apple's iPhones glob || Oil prices rose on Friday in Asia as China confirmed trade talks with the U.S., with WTI gaining 0.7% to $47.48 and Brent increasing 0.7% to $56.38 a barrel. The gains came after China's Commerce Ministry announced that deputy U.S. Trade || Gold prices surged past the psychologically significant level of $1,300 per ounce in Asia on Friday due to growing concerns over a potential global economic downturn. The rise in gold was attributed to weak PMI data from China and Apple's reduced quarterly sales forecast. Investors viewed gold as a safe haven asset amidst || In an internal memo, Huawei's Chen Lifang reprimanded two employees for sending a New Year greeting on the company's official Twitter account using an iPhone instead of a Huawei device. The incident caused damage to the brand and was described as a "blunder" in the memo. The mistake occurred due to || This news article reports on the positive impact of trade war talks between Beijing and Washington on European stock markets, specifically sectors sensitive to the trade war such as carmakers, industrials, mining companies, and banking. Stocks rallied with mining companies leading the gains due to copper price recovery. Bayer shares climbed despite a potential ruling restricting || Amazon has sold over 100 million devices with its Alexa digital assistant, according to The Verge. The company is cautious about releasing hardware sales figures and did not disclose holiday numbers for the Echo Dot. Over 150 products feature Alexa, and more than 28,000 smart home || The Supreme Court will review Broadcom's appeal in a shareholder lawsuit over the 2015 acquisition of Emulex. The case hinges on whether intent to defraud is required for such lawsuits, and the decision could extend beyond the Broadcom suit. An Emulex investor filed a class action lawsuit || The Chinese central bank announced a fifth reduction in the required reserve ratio (RRR) for banks, freeing up approximately 116.5 billion yuan for new lending. This follows mounting concerns about China's economic health amid slowing domestic demand and U.S. tariffs on exports. Premier Li Keqiang || The stock market rebounded strongly on Friday following positive news about US-China trade talks, a better-than-expected jobs report, and dovish comments from Federal Reserve Chairman Jerome Powell. The Dow Jones Industrial Average rose over 746 points, with the S&P 500 and Nasdaq Com Roku Inc announced plans to offer premium video channels on a subscription basis through its free streaming service, The Roku Channel., FDIC Chair, Jelena McWilliams, expressed no concern over market volatility affecting the U.S banking system., Oil prices rebounded on Thursday due to dollar weakness, signs of output cuts by Saudi Arabia, and weaker fuel oil margins leading Riyadh to lower February prices for heavier crude grades sold to Asia. Apple cut its quarterly revenue forecast for the first time in over 15 years due to weak iPhone sales in China, representing around 20% of Apple's revenue., Apple's profit warning led to an estimated $3.8 billion paper loss for Berkshire Hathaway due to its $252 million stake in Apple., Apple's Q1 revenue guidance cut, resulting from weaker demand in China, led to significant losses for Berkshire Hathaway and a $21.4 billion decrease in its market value.
1 2019-01-13 Sprint and Samsung plan to release 5G smartphones in nine U.S. cities this summer, with Atlanta, Chicago, Dallas, Houston, Kansas City, Los Angeles, New York City, Phoenix, and Washington D.C. being the initial locations. Rival Verizon also announced similar plans for the first half of 20 || AMS, an Austrian tech company listed in Switzerland and a major supplier to Apple, has developed a light and infrared proximity sensor that can be placed behind a smartphone's screen. This allows for a larger display area by reducing the required space for sensors. AMS provides optical sensors for 3D facial recognition features on Apple || Deutsche Bank upgraded Vivendi's Universal Music Group valuation from €20 billion to €29 billion, surpassing the market cap of Vivendi at €28.3 billion. The bank anticipates music streaming revenue to reach €21 billion in 2023 and identifies potential suitors for || Amazon's stock is predicted to surge by over 20% by the end of this year, according to a new report from Pivotal Research. Senior analyst Brian Wieser initiated coverage on the stock with a buy rating and a year-end price target of $1,920. The growth potential for Amazon lies primarily in || AMS, an Austrian sensor specialist, is partnering with Chinese software maker Face to develop new 3D facial recognition features for smartphones. This move comes as AMS aims to reduce its dependence on Apple and boost its battered shares. AMS provides optical sensors for Apple's 3D facial recognition feature on iPhones, || Geely, China's most successful carmaker, forecasts flat sales for 2019 due to economic slowdown and cautious consumers. In 2018, it posted a 20% sales growth, but missed its target of 1.58 million cars by around 5%. Sales dropped 44 || China is making sincere efforts to address U.S. concerns and resolve the ongoing trade war, including lowering taxes on automobile imports and implementing a law banning forced technology transfers. However, Beijing cannot and should not dismantle its governance model as some in Trump's administration have demanded. Former Goldman Sachs China || Stock index futures indicate a slightly lower open on Wall Street Monday, as investors remain cautious amid lack of progress in U.S.-China trade talks and political risks from the ongoing government shutdown. Dow futures were flat, S&P 500 dipped 0.13%, while Nasdaq 10 || Qualcomm, a leading chipmaker, has announced an expansion of its lineup of car computing chips into three tiers - entry-level, Performance, Premiere, and Paramount. This move is aimed at catering to various price points in the automotive market, similar to its smartphone offerings. The company has reported a backlog || The stock market showed minimal changes at the open as investors await trade talks progress between the U.S. and China. The S&P 500 dropped 0.04%, Dow lost 0.23%, but Nasdaq gained 0.2%. The ISM services index, expected to be released at 1 || The article suggests that some economists believe the US economy may have reached its peak growth rate, making the euro a potentially bullish investment. The EUR/USD exchange rate has held steady despite weak Eurozone data due to dollar weakness and stagnant interest rate expectations in Europe. However, concerns over economic growth are emerging due to sell || The Chinese smartphone market, the world's largest, saw a decline of 12-15.5 percent in shipments last year with December experiencing a 17 percent slump, according to China Academy of Information and Communications Technology (CAICT) and market research firm Canalys. This follows a 4 percent drop in ship || Austrian tech firm AT S lowered its revenue growth forecast for 2018/19 due to weak demand from smartphone makers and the automotive industry. The company now anticipates a 3% increase in sales from last year's €991.8 million, down from its previous projection of a 6- || The stock markets in Asia surged during morning trade on Wednesday, following reports of progress in U. S - China trade talks. Negotiators extended talks for a third day and reportedly made strides on purchases of U. S goods and services. However, structural issues such as intellectual property rights remain unresolved. President Trump is eager to strike || Mercedes Benz sold over 2.31 million passenger cars in 2018, making it the top selling premium automotive brand for the third year in a row. However, analysts question how long German manufacturers can dominate the luxury car industry due to the shift towards electric and self-driving cars. Tesla, with || The S&P 500 reached a three-week high on Tuesday, driven by gains in Apple, Amazon, Facebook, and industrial shares. Investors are optimistic about a potential deal between the US and China to end their trade war. The S&P 500 has rallied over 9% since late December || The stock market continued its rally on Tuesday, with the Dow Jones Industrial Average, S&P 500, and Nasdaq Composite all posting gains. Optimism over progress in trade talks between the US and China was a major contributor to the market's positive sentiment, with reports suggesting that both parties are narrowing their differences ahead || Roku's stock dropped by 5% on Tuesday following Citron Research's reversal of its long position, labeling the company as uninvestable. This change in stance came after Apple announced partnerships with Samsung to offer iTunes services on some Samsung TVs, potentially impacting Roku's user base growth. || The Chinese authorities are expected to release a statement following the conclusion of U. S. trade talks in Beijing, with both sides signaling progress toward resolving the conflict that has roiled markets. Chinese Vice Premier Liu He, who is also the chief economic adviser to Chinese President Xi Jinping, made an appearance at the negotiations and is || Xiaomi Co-founder Lei Jun remains optimistic about the future of his smartphone company despite a recent share slump that erased $6 billion in market value. The Chinese tech firm is shifting its focus to the high end and expanding into Europe, while shunning the US market. Xiaomi aims to elevate its Red || The European Commission has launched an investigation into Nike's tax treatment in the Netherlands, expressing concerns that the company may have received an unfair advantage through royalty payment structures. The EU executive has previously probed tax schemes in Belgium, Gibraltar, Luxembourg, Ireland, and the Netherlands, with countries ordered to recover taxes from benefici || Taiwan's Foxconn, a major Apple supplier, reported an 8.3% decline in December revenue to TWD 619.3 billion ($20.1 billion), marking its first monthly revenue dip since February. The fall was due to weak demand for consumer electronics. In 2018, Foxconn || Starting tomorrow, JD.com will offer reduced prices on some Apple iPhone 8 and 8 Plus models by approximately 600 yuan and 800 yuan respectively. These price drops, amounting to a savings of around $92-$130 per unit, are in line with earlier rumors suggesting price redu || Cummins, Inc. (CMI) announced that Pat Ward, its long-term Chief Financial Officer (CFO), will retire after 31 years of service on March 31, 2019. Mark Smith, who has served as Vice President of Financial Operations since 2014, will succeed Ward, || The Federal Reserve Chairman, Jerome Powell, maintained his patient stance on monetary policy but raised concerns about the balance sheet reduction. He indicated that the Fed's balance sheet would be substantially smaller, indicating the continuation of the balance sheet wind down operation. Despite this, Powell reassured investors of a slower pace on interest rate h || Wall Street experienced a decline after the opening bell on Friday, following five consecutive days of gains. The S&P 500 dropped 13 points or 0.54%, with the Dow decreasing 128 points or 0.54%, and the Nasdaq Composite losing 37 points or || Several Chinese retailers, including Alibaba-backed Suning and JD.com, have drastically reduced iPhone prices due to weak sales in China, which prompted Apple's recent revenue warning. Discounts for the latest XR model range from 800 to 1,200 yuan. These price || Green Dot, GDOT, is a bank holding company with a wide distribution network and impressive growth. Its product offerings include bank accounts, debit and credit cards, with a focus on perks. The firm's platform business, "banking as a service," powers offerings for partners such as Apple Pay Cash, Walmart Money || US stock futures declined on Friday as disappointing holiday sales and revenue cuts from various companies raised concerns about a potential recession. The S&P 500, Dow Jones Industrial Average, and Nasdaq 100 fell, with the Fed's possible policy pause and optimistic trade talks failing to offset these negative factors. || Apple's NASDAQ AAPL stock declined by 0.52% in premarket trade Friday due to price cuts of iPhone models in China, but the company is set to launch three new iPhone models this year. Johnson & Johnson's NYSE JNJ stock edged forward after raising prescription drug prices. Starbucks || Apple is reportedly set to release three new iPhone models this year, featuring new camera setups including triple rear cameras for the premium model and dual cameras for the others. The move comes after weak sales, particularly in China, led retailers to cut prices on the XR model. Amid sluggish sales, Apple opted to stick with Sprint and Samsung planning 5G smartphone release in nine U.S. cities, expanding market reach, AMS developing new 3D facial recognition features for smartphones, reducing dependence on Apple, Deutsche Bank upgrades Universal Music Group valuation, identifying potential suitors AMS lowers revenue growth forecast due to weak demand from smartphone makers and automotive industry, Chinese smartphone market experiences decline in shipments, impacting sales for companies, European Commission launches investigation into Nike's tax treatment in the Netherlands

Conclusions and Recommendations

Conclusion and Business Recommendation on Sentimental Analysis:

  • We see that the dataset is imbalanced dataset. We could improve on the data by applying undersampling or oversampling of data, collect more data with negative and positive sentiments to take care of the imbalance issue.
  • As per EDA, The stock price trend
    • begins to rise up from begining of the month, peaks before the mid of the month, the drops low by mid of the month.It stays low from mid of the month to the beginning of the next month and the cycle continues.
    • The above cycle is true for all stock prices - Open, High, Low and Close price.
  • We see that the Volume traded drops drastically from Month 1 to Month 2 and then remains low. We recommend the business to look into the reasons as why the stock volume decretase.
  • We built ML models - Random Forest and XGBoost Models . We were only able to get a recall of 46%. We recommend to look at other model and pre-trained models to increase the recall.
  • By predicting the sentiments, we will be able to predict how the stock price is going up or down.

Conclusion and Business Recommendation on weekly news summarization:

  • Use other pretrained models and compare the model outputs
  • It takes Lot of computacional resource to get summarization of positive and negative events on the Weekly data. Recommend Business to invest more in getting higher computational resources.
    • This will enable experimenting with different prompts to get better model response.
    • This will entable experimenting with different model parameters like max_token, top_p, temperature etc.
  • By providing the Summary of 3 positive events and 3 negative events, we will be able to use it as one of the parameters to predict how the stock is going to perform in the Market.
  • Business can look at the news articles and sentiments and take appropriate action to mitigate the risks associated with negative sentiments, use positive sentiments as marketing strategy.

-

Power Ahead